Keywords: Spatial Error, Spatial Lag, Geographically Weighted
Regression, Global & Local Moran’s I, Akaiki Information Criterion,
Schwarz Criterion, Log Likelihood, Likelihood Ratio Test
Introduction
Philadelphia’s housing market has undergone significant
transformations in recent years, with rising property values creating
urgent concerns for urban planners, policymakers, and residents alike.
This upward trend has intensified issues of housing affordability,
disproportionately impacting low- and middle-income households. The
rapid increase in housing values has spurred gentrification, displacing
long-term residents and reshaping the socio-economic fabric of various
neighborhoods across the city. As researchers have observed,
gentrification often leads to social displacement, exacerbating
inequality as housing affordability declines (Freeman & Braconi,
2004; Atkinson, 2004). Understanding the factors driving these shifts in
property values is essential for addressing affordability challenges,
promoting neighborhood stability, and fostering equitable development in
Philadelphia (Lees, 2008).
In a previous study, Ordinary Least Squares (OLS) regression was
employed to explore the relationships between median house values, the
dependent variable, and several key socio-economic predictors. These
predictors included educational attainment, vacancy rates, the
proportion of detached single-family homes, and poverty levels. Each
variable was chosen for its established influence on housing markets and
its ability to shed light on underlying socio-economic conditions that
may shape property values. For instance, educational attainment is
positively associated with economic prosperity and housing demand, as
areas with higher educational levels often benefit from higher incomes
and greater investment Vacancy rates, on the other hand, are typically
linked to neighborhood decline and reduced property values, as vacant
properties signal economic distress, discourage investment, and may
contribute to higher crime rates (Mallach, 2018). The housing market
preference for single-family homes, which offer greater space and
privacy, is well-documented in the literature (Glaeser & Gyourko,
2018), while research consistently demonstrates a negative correlation
between poverty levels and housing values, with higher poverty rates
often linked to decreased demand and underinvestment in local
infrastructure (Galster, 2008).
Although OLS regression provides a foundational understanding of
these relationships, it has limitations when applied to spatial data, as
it assumes independence between observations and ignores potential
spatial dependencies. Housing values in one area are often influenced by
those in nearby areas, resulting in spatial autocorrelation that can
lead to biased or misleading results when using traditional OLS methods.
Spatial autocorrelation reflects Tobler’s First Law of Geography, which
states that “everything is related to everything else, but near things
are more related than distant things” (Tobler, 1970). When spatial
dependencies are ignored, as is the case in OLS regression, the
estimates may suffer from omitted variable bias, yielding inaccurate
conclusions about the relationships between variables (Anselin,
1988).
To address these limitations, this report applies spatial econometric
techniques, specifically spatial lag, spatial error, and geographically
weighted regression (GWR) models, to more accurately capture the spatial
dependencies affecting housing values in Philadelphia. The spatial lag
model incorporates the influence of neighboring values directly into the
regression, allowing for an understanding of how housing values in one
area may affect those in adjacent areas (LeSage & Pace, 2009). The
spatial error model accounts for spatial autocorrelation in the
residuals, isolating unobserved spatially correlated factors that may
influence housing values (Anselin, 1988). Lastly, the geographically
weighted regression (GWR) model offers a localized perspective, allowing
coefficients to vary by location and capturing the heterogeneity of
relationships across different neighborhoods (Brunsdon, Fotheringham,
& Charlton, 1996). By employing these spatial techniques, this study
aims to enhance the accuracy of the initial OLS findings and provide a
more comprehensive understanding of the socio-economic and spatial
factors influencing housing values. These insights will support more
effective policy interventions and urban development strategies aimed at
achieving equitable and sustainable growth in Philadelphia.
Methods
Spatial Autocorrelation
The First Law of Geography, proposed by Waldo Tobler, states that
“everything is related to everything else, but nearer things are
more related than distant things.” This law captures the principle
of spatial dependence, which underpins many spatial analyses, including
spatial regression models. One way to measure spatial dependence is
through Moran’s I, a statistic that quantifies spatial autocorrelation.
Moran’s I for a variable is calculated as follows:
\[
I = \frac{n}{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}} \cdot
\frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (X_i - \bar{X})(X_j -
\bar{X})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}
\] where \(n\) is the number of
observations, \(X_i\) and \(X_j\) are the values of the variable at
locations \(i\) and \(j\), respectively, \(\bar{X}\) is the mean of the variable, and
\(w_{ij}\) is the spatial weight
between locations \(i\) and \(j\).
Here, we use a spatial weight matrix constructed using a “Queen”
contiguity method, which defines each unit’s neighbors based on shared
boundaries or vertices. This matrix remains consistent across the
analysis, although statisticians may use different weight matrices to
assess model sensitivity to neighbor definitions.
Testing the significance of spatial autocorrelation involves
evaluating whether the observed Moran’s I differs from what would be
expected under spatial randomness. In hypothesis testing, the null
hypothesis (\(H_0\)) is that there is
no spatial autocorrelation, while the alternative (\(H_a\)) posits the presence of spatial
autocorrelation. Using random permutations of data values across
locations, we generate a reference distribution of Moran’s I under the
null hypothesis and compare the observed statistic to this
distribution.
Beyond global spatial autocorrelation measured by Moran’s I, local
spatial autocorrelation identifies specific areas with clustering or
dispersion. The significance of local Moran’s I is tested similarly
using random permutations, and results can highlight statistically
significant clusters or outliers that global measures might miss.
Ordinary Least Squares Regression
Ordinary Least Squares (OLS) regression is a statistical technique
for estimating the relationship between a dependent variable and one or
more independent variables by minimizing the squared differences between
observed and predicted values. Key assumptions of OLS include linearity,
independence of observations, homoscedasticity (constant error
variance), normality of errors, and no multicollinearity among
predictors.
In our first assignment, we used OLS regression to assess how vacancy
rates, single-family housing percentage, educational attainment, and
poverty levels influence median house value. All predictors were
statistically significant, with vacancy rates and poverty levels
negatively affecting house values, while single-family housing and
educational attainment had positive effects. The model’s \(R^2\) was 0.66, explaining 66% of the
variance in house values. However, some predictors exhibited non-linear
patterns, and spatial autocorrelation suggested dependence among
observations, indicating that future models could benefit from spatial
regression techniques.
When data has a spatial component, the assumption that errors are
random and independent often doesn’t hold, as nearby observations may
exhibit similar error patterns. To test this, we can examine the spatial
autocorrelation of the residuals using Moran’s I, which quantifies the
degree of clustering in residuals. Another approach is to regress the
residuals on nearby residuals from neighboring areas, such as block
groups defined by the Queen matrix, to identify any spatial dependence.
These tests help determine if spatial autocorrelation is present,
indicating a need for spatial regression techniques.
In R, there are methods to test other key regression assumptions. For
homoscedasticity (constant error variance), which is related to the
independence of errors, we use the Breusch-Pagan Test,
the Koenker-Bassett Test, and the White
Test. The null hypothesis (\(H_0\)) is that errors are homoscedastic,
while the alternative hypothesis (\(H_a\)) is that errors exhibit
heteroscedasticity. For normality of errors, we can use the
Jarque-Bera test. The null hypothesis is that residuals
are normal, and the alternative hypothesis is that they are not normal.
We want to not be able to reject the null hypothesis (i.e., get a
p-value of 0.05 or higher).
Spatial Lag and Spatial Error Regression
In this report, we use R to run spatial lag and spatial error
regressions. Among the two methods, spatial lag regression assumes the
value of the dependent variable at one location is associated with the
values of that variable in nearby locations, where nearby is as defined
by the weights matrix W (rook, queen, within a certain distance of one
another). In short, it introduces a spatially lagged dependent variable
into the model, compared to our previous OLS regression. In our context,
the spatial lag model could be represented as the following:
\[
\text{LNMEDHVAL} = \rho W \cdot \text{LNMEDHVAL} + \beta_0 + \beta_1
\cdot \text{PCTVACANT} + \beta_2 \cdot \text{PCTSINGLES} + \beta_3 \cdot
\text{PCTBACHMOR} + \beta_4 \cdot \text{LNNBELPOV100} + \epsilon_i
\]
where \(\text{LNMEDHVAL}\) is the
log of median home value in location, \(\rho\) is the spatial lag coefficient that
measures the influence of neighboring areas, \(W\) is the spatial weights matrix (in this
case, the queenlist spatial weights), and \(W
\cdot \text{LNMEDHVAL}\) is the spatially lagged dependent
variable. The other terms are the same as in the OLS regression, where
\(\text{PCTVACANT}\), \(\text{PCTSINGLES}\), \(\text{PCTBACHMOR}\), and \(\text{LNNBELPOV}\) are the predictors,
\(\beta_1, \beta_2, \beta_3, \beta_4\)
are the coefficients, \(\beta_0\) is
the intercept term, and \(\epsilon_i\)
is the error term.
The spatial error model, on the other hand, assumes that the error
term is spatially autocorrelated, meaning that the residual in one
location is associated with residuals at nearby locations, where nearby
is also deifined by the weight matrix. The spatial error model could be
represented as the following:
\[
\text{LNMEDHVAL} = \beta_0 + \beta_1 \cdot \text{PCTVACANT} + \beta_2
\cdot \text{PCTSINGLES} + \beta_3 \cdot \text{PCTBACHMOR} + \beta_4
\cdot \text{LNNBELPOV100} + \lambda W \cdot \epsilon + u
\]
where \(\text{LNMEDHVAL}\) is the
log of median home value, \(\beta_0\)
is the intercept term, \(\text{PCTVACANT}\), \(\text{PCTSINGLES}\), \(\text{PCTBACHMOR}\), and \(\text{LNNBELPOV}\) are the predictors,
\(\beta_1, \beta_2, \beta_3, \beta_4\)
are the coefficients, as in OLS regression. \(\lambda\) is the spatial error coefficient
that measures the degree of spatial correlation in the error term, \(W\) is the spatial weights matrix, \(W \cdot \epsilon\) is the spatially lagged
error term, and \(u\) is the random
noise. To put it simply, we are regressing residuals on the nearest
neighbor residuals, thereby filtering the spatial information out of the
OLS residuals
Spatial error regression and spatial lag regression both require the
standard assumptions for OLS regression—such as linearity,
homoscedasticity, and normality of errors—except for the assumption of
spatial independence among observations. This adjustment allows the
model to account for spatial structure in either the dependent variable
(spatial lag) or the error term (spatial error). That said, their goal
is to account for spatial dependence in the data, aiming to reduce
spatial autocorrelation in the regression residuals. Both methods
minimize spatial patterns in residuals that could otherwise lead to
biased or inefficient estimates.
We compare the results of spatial lag and spatial error regression
with OLS to decide whether the spatial models perform better than OLS
based on a number of criteria: Akaike Information Criterion, Schwarz
Criterion, Log Likelihood, and Likelihood Ratio Test. The Akaike
Information Criterion (AIC) and Schwarz Criterion (SC
or BIC) are used to compare the goodness of fit of different
models. They are relative measures of the information that is lost when
a given model is used to describe reality and can be said to describe
the tradeoff between precision and complexity of the model. The lower
the AIC or SC, the better the model.
The Log-Likelihood is associated with the maximum
likelihood method of fitting a statistical model to the data and
estimating model parameters. Maximum likelihood picks the values of the
model parameters that make the data “more likely” than any other values
of the parameters would make them. Higher log-likelihood values indicate
a model that better explains the observed data.
The Likelihood Ratio Test is used to formally test
whether adding spatial dependence to a model (as in spatial lag or
spatial error models) significantly improves model fit compared to OLS.
For this test, the null hypothesis (\(H_0\)) state that the spatial model does
not provide a significantly better fit than OLS, while the alternative
hypothesis (\(H_a\)) states that the
spatial model provides a significantly better fit than OLS.
The decision rule is to reject the null hypothesis if the \(LR\) test statistic is significant (i.e.,
the p-value is below a chosen significance level, typically 0.05), and
conclude that the spatial model is a better fit than OLS. If not, OLS
may be adequate. It should be noted that the log likelihood method
and the likelihood ratio test should be used for comparing nested
models. Spatial lag and spatial error are not a special case of each
other – we cannot use the log likelihood ratio to compare them.
Alternatively, we can also compare the spatial models to OLS using
the Moran’s I statistic, which measures the spatial
autocorrelation of the residuals. Moran’s I ranges from -1 to 1, where
-1 indicates perfect dispersion, 0 indicates no spatial autocorrelation,
and 1 indicates perfect correlation. For our models, the goal is to
minimize spatial autocorrelation in the residuals. If the Moran’s I
statistic for the residuals of a spatial lag or spatial error model is
closer to zero than the Moran’s I for the OLS model, we can conclude
that the spatial model better captures spatial dependencies.
Geographically Weighted Regression
We will conduct our Geographically Weighted Regression (GWR) analyses
in R. GWR is a form of local regression that helps address spatial
heterogeneity in data, which is essential when analyzing spatial data
prone to Simpson’s Paradox — where a trend observed in aggregate data
can differ from trends in subsets. GWR allows us to examine
relationships at a local level rather than assuming they are uniform
across the study area. The general GWR equation can be expressed as:
\[
y_i = \beta_{i0} + \sum_{k=1}^{m} \beta_{ik}x_{ik} + \epsilon_i
\]
where: \(y_i\) is the dependent
variable at location \(i\), \(\beta_{i0}\) is the intercept for location
\(i\), allowing a unique baseline for
each location, \(\beta_{ik}\) is the
coefficient for the \(k\)-th predictor
at location \(i\), \(x_{ik}\) is the value of the \(k\)-th predictor variable at location \(i\), and \(\epsilon_i\) is the error term at location
\(i\).
In GWR, local regression is performed by fitting a regression model
at each observation point, using a subset of neighboring points, with
weights assigned based on their distance from the focal point. The
concept of bandwidth controls the number of neighbors included,
influencing how “local” each regression is. There are two types of
bandwidths: adaptive and fixed. A fixed bandwidth uses a constant
distance for all points, while an adaptive bandwidth varies, adjusting
to include a set number of nearby observations regardless of spatial
density. Here, we will use adaptive bandwidth, which is more appropriate
as it accounts for varying spatial densities, providing a flexible
analysis that better captures local relationships in areas with
different population distributions.
Although GWR allows for spatial variation in relationships, the
standard OLS assumptions (linearity, independence, homoscedasticity, and
normality) still apply. Multicollinearity is assessed using the
Condition Number, and high multicollinearity can cause issues in GWR,
leading to unstable estimates and clustering in parameter estimates. It
is also important to note that GWR does not provide p-values for
coefficients, as the model focuses on exploring spatial patterns rather
than testing global hypotheses.
Results
Global and Local Moran’s I
The Global Moran’s I analysis for the dependent variable, \(\text{LNMEDHVAL}\) (the natural log of
median house value), reveals a pronounced level of spatial
autocorrelation. With a Moran’s I statistic of 0.8, the results indicate
a strong positive spatial autocorrelation. This high value suggests that
areas with similar median house values tend to cluster geographically
within the study area. In other words, neighborhoods with either high or
low house values are more likely to be located near other neighborhoods
with similar values, rather than being randomly distributed across
space. This spatial clustering points to the presence of spatial
dependencies in housing values, possibly driven by neighborhood
characteristics, socio-economic factors, or other spatial processes
influencing property values across the region.
To validate the significance of this observed spatial
autocorrelation, a Monte Carlo permutation test was conducted using 1000
simulations. This approach involved randomly permuting the values of
\(\text{LNMEDHVAL}\) across spatial
units to generate a distribution of Moran’s I values under the null
hypothesis of no spatial autocorrelation. The results, visualized in a
histogram of permuted Moran’s I values, show that the observed Moran’s I
of 0.8 lies at the extreme end of this distribution, marked in red. With
an observed rank of 1000 (the highest rank in the distribution), the
observed Moran’s I value exceeded all permuted values, emphasizing the
extremity of the spatial clustering in the actual data.
The test result is further supported by a highly significant p-value
(\(p < 2 \times 10^{-16}\)). This
exceptionally low p-value strongly rejects the null hypothesis of no
spatial autocorrelation, confirming that the spatial arrangement of
\(\text{LNMEDHVAL}\) values is not
random. Instead, the observed clustering is statistically significant,
indicating that spatial processes are likely influencing the
distribution of housing values in the study area.
globalmoranMC<-moran.mc(regData$LNMEDHVAL, queenlist, nsim=999, alternative="two.sided")
globalmoranMC
##
## Monte-Carlo simulation of Moran I
##
## data: regData$LNMEDHVAL
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.8, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
This graph shows the results of the Monte Carlo permutation test to
assess the significance of the observed Moran’s I statistic.The
histogram shows the frequency of Moran’s I values generated by these
random permutations, with the observed Moran’s I value highlighted in
red. The observed statistic of 0.8 stands far to the right of the
permuted values, underscoring its extremity and significance. The high
observed rank (1000) indicates that none of the permuted values exceeded
the actual Moran’s I. This extremely low p-value provides strong
evidence against the null hypothesis of no spatial autocorrelation,
confirming that the spatial clustering observed in \(\text{LNMEDHVAL}\) is statistically
significant and unlikely to be due to random variation.
ggplot(data.frame(res = globalmoranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = globalmoranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Global Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))

This graph offers a different perspective on the spatial structure of
\(\text{LNMEDHVAL}\) by displaying a
scatter plot of the variable’s values against their spatial lag (a
measure of neighboring values). The x-axis represents the logged median
house values \(\text{LNMEDHVAL}\),
while the y-axis displays the spatially lagged values of \(\text{LNMEDHVAL}\), computed based on a
queen contiguity spatial weights matrix that considers neighboring
spatial units. The positive slope of the red trend line in the scatter
plot indicates a positive spatial autocorrelation, where areas with
higher median house values are typically surrounded by other areas with
high values, and similarly, areas with lower values are near other
low-value areas. This linear relationship between a location’s \(\text{LNMEDHVAL}\) and the average values
in surrounding locations highlights the clustering of similar values and
supports the result from the Global Moran’s I statistic.
ggplot(data = data.frame(
LNMEDHVAL = regData$LNMEDHVAL,
spatial_lag = lag.listw(queenlist, regData$LNMEDHVAL)
), aes(x = LNMEDHVAL, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Global Moran's I Scatter Plot",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))

In a Local Moran’s I significance map, areas with significant low
p-values refers to area where the spatial autocorrelation is
statistically significant, either hotspots of concentrated high values
or cold spots of low values. There areas are mostly found in the
northwestern and central parts of the city. As we may see, these areas
are then surrounded by areas with slightly higher p-values and then
areas with even higher p-values, until p becomes insignificant.
moranSig.plot <- function(df, listw, title) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
df$Pr.z <- local[, "Pr(z != E(Ii))"]
df$pval_category <- cut(df$Pr.z,
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("0.000 - 0.001", "0.001 - 0.010", "0.010 - 0.050", "0.050 - 1.000"),
include.lowest = TRUE)
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = pval_category), color = NA, alpha = 0.9) +
scale_fill_brewer(type = "div", palette = 6, name = "P-Value") +
labs(title = title) +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
}
moranSig.plot(localmoran, queenlist, 'Significance Map of Local Moran I')

The Cluster Map derived from the Local Moran’s I analysis classified
census tracts in Philadelphia into high-high, high-low, low-high,
low-low, or not significant, representing different types of spatial
relationships in house values within neighborhoods and their
surroundings.
High-High Clusters: Areas classified as high-high
are those where high values of median house prices are surrounded by
other high-value areas. These clusters are prominently located in the
northwestern parts of the city and some central regions, indicating
pockets of economic affluence. The high concentration of high-value
properties in these regions suggests established, affluent neighborhoods
where housing prices remain high due to demand and possibly the presence
of amenities or other attractive urban features.
Low-Low Clusters: Low-low clusters represent areas
where low property values are surrounded by other low-value areas,
highlighting economically disadvantaged zones. These clusters are
predominantly found in the southwestern and northeastern parts of the
city. The spatial clustering of low-value properties in these areas
suggests neighborhoods that may face economic challenges, possibly with
limited access to amenities or fewer investment opportunities. These
regions may require targeted policy interventions to address underlying
issues that contribute to the lower property values.
High-Low Clusters: High-low clusters are
transitional zones where high-value areas are adjacent to lower-value
areas. These areas, marked in light red on the map, are generally
scattered around the boundaries of high-value neighborhoods, such as in
sections of central and northwest Philadelphia. The proximity of
high-value properties to lower-value ones in these clusters can indicate
economic contrasts or areas experiencing gentrification, where property
values in traditionally lower-income neighborhoods may be increasing due
to spillover effects from nearby affluent areas.
Low-High Clusters: Low-high clusters, where
low-value properties are surrounded by higher-value areas, are less
common but appear in the northern and eastern parts of the city. These
clusters suggest isolated pockets of economic disadvantage within more
affluent areas. This pattern may indicate areas that are yet to benefit
from surrounding economic growth or might be experiencing challenges
that prevent them from aligning with the prosperity of neighboring
regions.
Not Significant Areas: Lastly, the not significant
areas, shaded in gray, indicate neighborhoods where the local Moran’s I
statistic was not significant. These regions are scattered throughout
the city, representing areas where house values do not exhibit strong
spatial clustering. The lack of significant clustering in these areas
suggests a more random distribution of house values, which may occur in
more mixed-use or transitional neighborhoods where economic
characteristics are varied.
hl.plot <- function(df, listw) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
quadrant <- vector(mode = 'numeric', length = nrow(df))
m.prop <- df$LNMEDHVAL - mean(df$LNMEDHVAL)
m.local <- local[, 1] - mean(local[, 1])
signif <- 0.05
quadrant[m.prop > 0 & m.local > 0] <- 1 # high-high
quadrant[m.prop < 0 & m.local < 0] <- 2 # low-low
quadrant[m.prop < 0 & m.local > 0] <- 4 # low-high
quadrant[m.prop > 0 & m.local < 0] <- 3 # high-low
quadrant[local[, 5] > signif] <- 5 # insignificant
df$quadrant <- factor(quadrant, levels = c(1, 3, 5, 2, 4),
labels = c("High-High", "High-Low", "Non-Significant", "Low-Low", "Low-High"))
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = quadrant), color = "#848884", lwd = 0.07) +
scale_fill_brewer(type = "div", palette = 6, name = "Cluster Type") +
labs(title = "Local Moran's I Cluster Map") +
theme(legend.position="right",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
}
hl.plot(regData, queenlist)

OLS Regression Results
Below, we will provide a quick walk through of the OLS regression
results. Detail interpretation and dicussion of the results can be found
in the GitHub
repository for assignment 1.
All predictors in our OLS regression, which include \(\text{PCTSINGLES}\), \(\text{PCTVACANT}\), \(\text{LNNBELPOV}\), and \(\text{PCTBACHMOR}\), are statistically
significant in explaining \(\text{LNMEDHVAL}\). The model accounts for
approximately 66.2% of the variance, as indicated by the R-squared value
of 0.662.
OLS <- lm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData)
fitted_values <- fitted(OLS)
residuals_values <- residuals(OLS)
standardized_residuals <- rstandard(OLS)
resnb<-sapply(queen, function(x) mean(standardized_residuals[x]))
regData <- regData %>%
mutate(
Fitted = fitted_values,
Residuals = residuals_values,
Standardized_Residuals = standardized_residuals,
Residuals_NB = resnb)
summary(OLS)
##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2582 -0.2039 0.0382 0.2174 2.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.113778 0.046532 238.84 < 0.0000000000000002 ***
## PCTVACANT -0.019156 0.000978 -19.59 < 0.0000000000000002 ***
## PCTSINGLES 0.002977 0.000703 4.23 0.000024 ***
## PCTBACHMOR 0.020910 0.000543 38.49 < 0.0000000000000002 ***
## LNNBELPOV -0.078903 0.008457 -9.33 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.366 on 1715 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 841 on 4 and 1715 DF, p-value: <0.0000000000000002
The results from the Breusch-Pagan, Koenker-Bassett, and White’s
tests all detect heteroscedasticity in the model, as all p-values are
extremely low. This confirms a clear violation of the homoscedasticity
assumption in OLS regression.
# Breusch-Pagan Test
bptest(OLS, studentize=FALSE)
##
## Breusch-Pagan test
##
## data: OLS
## BP = 113, df = 4, p-value <0.0000000000000002
# Koenker-Bassett Test
bptest(OLS)
##
## studentized Breusch-Pagan test
##
## data: OLS
## BP = 43, df = 4, p-value = 0.00000001
# White Test
white_test(OLS)
## White's test results
##
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 43.94
## P-value: 0
Comparing to the scatter plot we made to identify heteroscedasticity
from the previous assignment, the residuals in this scatter plot show no
discernible trend of rising or falling variance with increasing fitted
values; instead, they seem to be pretty uniformly distributed around the
horizontal line at zero. Regardless, the Breusch-Pagan test is often
more sensitive than a residual plot. Even small deviations that are hard
to see visually may be detected.
ggplot(regData, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#c44536", size = 1) + #
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Fitted Values",
y = "Standardized Residuals"
) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))

The results of the Jarque-Bera test indicate a significant deviation
from normality in the residuals. The p-value is well below typical
significance levels, and rejects the null hypothesis that the residuals
are normally distributed. Therefore, the test results suggest that the
residuals do not meet the normality assumption required for OLS
regression.
# Jarque-Bera Test
jarque.bera.test(OLS$residuals)
##
## Jarque Bera Test
##
## data: OLS$residuals
## X-squared = 779, df = 2, p-value <0.0000000000000002
Comparing to the histogram of standardized residuals from the
previous assignment, we may also notice that the residuals are not
perfectly normal. The histogram reveals a slight departure from a normal
distribution and is left skewed. Despite that the skewness is not very
pronounced, it is still consistent with the Jarque-Bera test
results.
ggplot(regData, aes(x = Standardized_Residuals)) +
geom_histogram(bins = 30, fill = "#283d3b", alpha = 0.9) +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))

The scatterplot below shows the relationship between the standardized
residuals of the OLS model and the residuals of their nearest neighbors.
The upward-sloping trend line indicates a positive relationship between
these residuals and their neighbors, suggesting spatial autocorrelation.
The coefficient for the residuals of their nearest neighbors is 0.7323,
with a highly significant p-value, demonstrating a strong positive
relationship. This implies that residuals are positively correlated with
those of their closest neighbors, implying spatial dependence.
ggplot(regData, aes(x = Residuals_NB, y = Standardized_Residuals)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "#c44536", size = 1) +
labs(title = "Residuals vs. Nearest Neighbor Residuals",
x = "Nearest Neighbor Residuals",
y = "Standardized Residuals") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))

The Moran’s I scatterplot and the results from the permutations for
OLS regression residuals both indicate significant spatial
autocorrelation. The scatterplot shows a positive relationship between
the logged median house values and their spatially lagged values, with a
positive slope, confirming that high values tend to be near other high
values, and low values cluster with other low values. The Moran’s I
statistic from the Monte Carlo simulation is approximately 0.3, with an
extremely low p-value, confirming that this spatial autocorrelation is
statistically significant.
This significant spatial autocorrelation in the OLS residuals is
problematic because it violates the OLS assumption of independent
residuals. When residuals are spatially autocorrelated, it suggests that
the model is missing spatial structure in the data, which could lead to
biased or inefficient estimates.
OLS_moranMC<-moran.mc(standardized_residuals, queenlist, nsim=999, alternative="two.sided")
OLS_moranMC
##
## Monte-Carlo simulation of Moran I
##
## data: standardized_residuals
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.3, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
Both the Moran’s I statistic and the positive beta coefficient from
the regression of standardized residuals on their nearest neighbors tell
a similar story. They both reveal that spatial dependence is present in
the residuals, suggesting that a spatial model, such as Geographically
Weighted Regression (GWR) or a spatial autoregressive model, would be
more appropriate for capturing the underlying spatial relationships in
the data.
plt1 <- ggplot(data.frame(res = OLS_moranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =OLS_moranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of OLS Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt2 <- ggplot(data = data.frame(
residuals =standardized_residuals,
spatial_lag = lag.listw(queenlist, standardized_residuals)
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for OLS Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt1 + plt2

Spatial Lag and Error Regression Results
Spatial Lag Regression
In the spatial lag regression output, we may see that the term \(\text{LNMEDHVAL}\), represented by \(\rho\) in the model, has an estimate of
0.651 and an extremely significant p-value (\(p < 2.22 \times 10^{-16}\)).This
indicates that nearby values of the dependent value influence each
other. The significant positive coefficient (\(\rho = 0.651\)) suggests a substantial
positive spatial lag effect, meaning that higher median home values in
neighboring areas are associated with higher median home values in the
area being examined.
All predictors in the spatial lag model are significant. \(\text{PCTVACANT}\) has a coefficient of
-0.0085 and is highly significant since \(p
< 2.22 \times 10^{-16}\). It means a higher vacancy rate is
associated with lower median home values. \(\text{PCTSINGLES}\) has a coefficient of
0.0020, significant with a \(p <
0.0001\). The positive association suggests that areas with a
higher proportion of single households may slightly increase median home
values. \(\text{PCTBACHMOR}\) has a
coefficient of 0.0085 and is extremely significant (\(p < 2.22 \times 10^{-16}\)). Higher
education levels are positively associated with home values, meaning
that more educated areas tend to have higher home values. \(\text{LNNBELPOV}\) has a coefficient of
-0.0341, also highly significant (\(p < 1
\times 10^{-7}\)). The negative association suggests that higher
poverty levels are associated with lower median home values.
The OLS model also shows significant coefficients for all predictors,
but with larger effect sizes. For example, has a coefficient of -0.0192
in OLS, compared to -0.0085 in spatial lag. This suggests that OLS
overestimates the influence of these predictors due to omitted spatial
dependence. For the other predictors, the OLS coefficients are also
larger in magnitude than the spatial lag coefficients.
SL <- lagsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData, queenlist)
summary(SL)
##
## Call:lagsarlm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = regData, listw = queenlist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.655421 -0.117248 0.018654 0.133126 1.726436
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.89845489 0.20111357 19.3843 < 0.00000000000000022
## PCTVACANT -0.00852940 0.00074367 -11.4694 < 0.00000000000000022
## PCTSINGLES 0.00203342 0.00051577 3.9425 0.00008063503
## PCTBACHMOR 0.00851381 0.00052193 16.3120 < 0.00000000000000022
## LNNBELPOV -0.03405466 0.00629287 -5.4116 0.00000006246
##
## Rho: 0.651, LR test value: 912, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.018
## z-value: 36.1, p-value: < 0.000000000000000222
## Wald statistic: 1301, p-value: < 0.000000000000000222
##
## Log likelihood: -256 for lag model
## ML residual variance (sigma squared): 0.0719, (sigma: 0.268)
## Number of observations: 1720
## Number of parameters estimated: 7
## AIC: 525, (AIC for lm: 1435)
## LM test for residual autocorrelation
## test value: 67.7, p-value: 0.00000000000000022204
We ran Breusch-Pagan test to assess whether the residuals of the
spatial lag regression model exhibit heteroscedasticity. As we may see,
the test statistic for the spatial lag regression is \(BP = 211\) with \(df = 4\), and \(p
< 2 \times 10^{-16}\). Given the extremely low p-value, we
reject the null hypothesis of homoscedasticity, indicating that
residuals in the spatial lag regression model remain
heteroscedastic.
# Breusch-Pagan Test
bptest.Sarlm(SL, studentize=FALSE)
##
## Breusch-Pagan test
##
## data:
## BP = 211, df = 4, p-value <0.0000000000000002
When comparing the OLS and spatial lag regression models, several
metrics show the spatial lag model provides a better fit. The spatial
lag model has a significantly lower AIC (525) than OLS (1435),
indicating a better model fit by penalizing less for added complexity.
Similarly, the Schwarz Criterion is much lower for the spatial lag
model, suggesting it captures the structure of the data more effectively
than the OLS model. The spatial lag model (-256) has a higher log
likelihood than OLS (-711), meaning that it better explains the
variability in the data.
# Akaike Information Criterion
aic_ols <- AIC(OLS)
aic_sl <- AIC(SL)
# Schwarz Criterion
bic_ols <- BIC(OLS)
bic_sl <- BIC(SL)
# The Log Likelihood
loglik_ols <- logLik(OLS)
loglik_sl <- logLik(SL)
results <- data.frame(
Model = c("OLS Regression", "Spatial Lag Regression"),
AIC = c(aic_ols, aic_sl),
SchwarzCriterion = c(bic_ols, bic_sl),
LogLikelihood = c(loglik_ols, loglik_sl)
)
results %>%
kable(row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
|
Model
|
AIC
|
SchwarzCriterion
|
LogLikelihood
|
|
OLS Regression
|
1435
|
1468
|
-711
|
|
Spatial Lag Regression
|
525
|
564
|
-256
|
The Likelihood Ratio Test assesses the improvement in model fit
between the OLS and spatial lag models, and the result we get here shows
\(LR = 912\) with \(df = 1\), and \(p
< 2 \times 10^{-16}\). With a large test statistic and low
p-value, we conclude that the spatial lag model significantly improves
fit over the OLS model.
# The Likelihood Ratio Test
lr_test <- LR.Sarlm(SL, OLS)
lr_test
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 912, df = 1, p-value <0.0000000000000002
## sample estimates:
## Log likelihood of SL Log likelihood of OLS
## -256 -711
Finally, we examined Moran’s I for the spatial lag regression model’s
residuals.The observed Moran’s I of -0.08 suggests a small negative
spatial autocorrelation in the residuals of the spatial lag model. With
\(p = 0.002\), we reject the null
hypothesis of no spatial autocorrelation. This negative Moran’s I
indicates that residuals are distributed with slight spatial dispersion
rather than clustering, suggesting the spatial lag model effectively
accounts for most spatial dependencies in the data.
SL_moranMc<-moran.mc(SL$residuals, queenlist,999, alternative="two.sided")
SL_moranMc
##
## Monte-Carlo simulation of Moran I
##
## data: SL$residuals
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = -0.08, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided
Both the residual histograms as well as the scatter plot support this
conclusion. The histogram of residuals is also approximately normally
distributed, and the scatter plot of residuals against fitted values
shows no clear pattern or trend. All of these suggest that the spatial
lag model is a good fit for the data compared to OLS.
plt3 <- ggplot(data.frame(res = SL$residuals), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =SL_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of SL Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt4 <- ggplot(data = data.frame(
residuals =SL$residuals,
spatial_lag = lag.listw(queenlist, SL$residuals)
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for SL Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt3 + plt4

Spatial Error Regression
In the spatial error regression model, the term \(\lambda\) represents the spatial
autocorrelation within the model’s error terms. In this case, \(\lambda\) is estimated at 0.815 and is
statistically significant, with \(p < 2.22
\times 10^{-16}\). This highly significant and large value of
\(\lambda\) suggests a strong spatial
dependency in the error terms.
Examining the individual predictors in the model, all variables are
statistically significant at a high confidence level. \(\text{PCTVACANT}\) has an estimate of
-0.0058 (\(p < 6.94 \times
10^{-9}\)), indicating that vacancy rates negatively affect
median home values. \(\text{PCTSINGLES}\) has a positive
coefficient of 0.0027 (\(p < 1.61 \times
10^{-5}\)), suggesting a small positive association with median
home values. \(\text{PCTBACHMOR}\) ,
representing the percentage of bachelor’s degrees or higher, shows a
positive effect with an estimate of 0.0098 (\(p < 2.20 \times 10^{-16}\)), indicating
that higher education levels correlate with higher home values. \(\text{LNNBELPOV}\) has a negative estimate
of -0.0345 and \(p < 1.11 \times
10^{-6}\), implying that higher levels of neighborhood poverty
are associated with lower median home values.
Similar to spatial lag regression, when comparing these results with
the OLS regression model, we observe that the coefficients for our
predictors are all smaller in magnitude than their OLS counterparts. For
instance, had a coefficient of -0.019156 in the OLS model, while in the
spatial error model, it is reduced to -0.0058. This reduction in
coefficient magnitude suggests that the spatial error model provides a
more conservative estimate of the effects of these predictors on median
home values, likely due to the accounted spatial autocorrelation.
SE <- errorsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData, queenlist)
summary(SE)
##
## Call:errorsarlm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = regData, listw = queenlist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.926477 -0.115408 0.014889 0.133852 1.948664
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.90643427 0.05346777 203.9815 < 0.00000000000000022
## PCTVACANT -0.00578308 0.00088670 -6.5220 0.00000000006937
## PCTSINGLES 0.00267792 0.00062083 4.3134 0.00001607389089
## PCTBACHMOR 0.00981293 0.00072896 13.4615 < 0.00000000000000022
## LNNBELPOV -0.03453409 0.00708933 -4.8713 0.00000110881162
##
## Lambda: 0.815, LR test value: 678, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.0164
## z-value: 49.8, p-value: < 0.000000000000000222
## Wald statistic: 2477, p-value: < 0.000000000000000222
##
## Log likelihood: -373 for error model
## ML residual variance (sigma squared): 0.0766, (sigma: 0.277)
## Number of observations: 1720
## Number of parameters estimated: 7
## AIC: 759, (AIC for lm: 1435)
Based on the results of the Breusch-Pagan test, the spatial lag
regression residuals are indeed heteroscedastic. The test statistic is
\(BP = 23\) with \(df = 4\), and \(p
< 0.0001\). Since the p-value is significantly lower than the
conventional alpha levels (0.01, 0.05), we reject the null hypothesis of
homoscedasticity. This indicates that the variance of the residuals is
not constant across observations, suggesting the presence of
heteroscedasticity in the model’s residuals.
# Breusch-Pagan Test
bptest.Sarlm(SE, studentize=FALSE)
##
## Breusch-Pagan test
##
## data:
## BP = 23, df = 4, p-value = 0.0001
We also compared the spatial error model to the OLS model using the
AIC, SC, log-likelihood, and likelihood ratio test. The spatial error
model has a lower AIC (759) and SC (798) than the OLS model (1435 and
1468, respectively), indicating a better fit. The log-likelihood of the
spatial error model (-373) is also higher than that of the OLS model
(-711), suggesting that the spatial error model better explains the
variability in the data.
# Akaike Information Criterion
aic_se <- AIC(SE)
# Schwarz Criterion
bic_se <- BIC(SE)
# The Log Likelihood
loglik_se <- logLik(SE)
results <- data.frame(
Model = c("OLS Regression", "Spatial Lag Regression", "Spatial Error Regression"),
AIC = c(aic_ols, aic_sl, aic_se),
SchwarzCriterion = c(bic_ols, bic_sl, bic_se),
LogLikelihood = c(loglik_ols, loglik_sl, loglik_se)
)
results %>%
kable(row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
|
Model
|
AIC
|
SchwarzCriterion
|
LogLikelihood
|
|
OLS Regression
|
1435
|
1468
|
-711
|
|
Spatial Lag Regression
|
525
|
564
|
-256
|
|
Spatial Error Regression
|
759
|
798
|
-373
|
According to the Likelihood Ratio Test, we see \(LR = 678\) with \(df = 1\), and \(p
< 2 \times 10^{-16}\). With a large test statistic and low
p-value, we may also conclude that the spatial error model significantly
improves fit over the OLS model.
lr_test <- LR.Sarlm(SE, OLS)
lr_test
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 678, df = 1, p-value <0.0000000000000002
## sample estimates:
## Log likelihood of SE Log likelihood of OLS
## -373 -711
Finally, we examined Moran’s I for the spatial error regression
model’s residuals. The observed Moran’s I of -0.09 suggests a small
negative spatial autocorrelation in the residuals of the spatial error
model. With \(p = 0.002\), we reject
the null hypothesis of no spatial autocorrelation. This negative Moran’s
I indicates that residuals are distributed with slight spatial
dispersion rather than clustering, suggesting the spatial error model
effectively accounts for most spatial dependencies in the data.
SE_moranMc<-moran.mc(residuals(SE), queenlist,999, alternative="two.sided")
SE_moranMc
##
## Monte-Carlo simulation of Moran I
##
## data: residuals(SE)
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = -0.09, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided
Both the residual histograms as well as the scatter plot support this
conclusion here again. The histogram of residuals is approximately
normally distributed, and the scatter plot of residuals against fitted
values shows a much less evident trend compared to the OLS model.
Therefore, all of these suggest that the spatial error model is a good
fit for the data compared to OLS.
plt5 <- ggplot(data.frame(res = residuals(SE)), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =SE_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of SE Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt6 <- ggplot(data = data.frame(
residuals = residuals(SE),
spatial_lag = lag.listw(queenlist, residuals(SE))
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for SE Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt5 + plt6

It should be noted that here, we also compared spatial lag model with
the spatial error model. The spatial lag model has a lower AIC (525) and
SC (564) than the spatial error model (759 and 798, respectively),
indicating a better fit. The log-likelihood of the spatial lag model
(-256) is also less negative than that of the spatial lag error (-373),
suggesting that the spatial lag model better explains the variability in
the data.
Geographically Weighted Regression Results
The \(R^2\) for the OLS regression
model is 0.662, while the \(R^2\)
(Quasi-global \(R^2\) ) for the GWR
model is 0.848. The higher \(R^2\)
value of the GWR model indicates that it does a better job of explaining
the variance in the dependent variable \(\text{LNMEDHVAL}\) compared to the OLS
model. This improvement suggests that the spatial variability captured
by the GWR model provides a more accurate representation of the
relationships between the predictors and the dependent variable, as GWR
can account for spatial heterogeneity that OLS cannot.
Based on the Akaike Information Criterion (AIC) values, the GWR
model, with an AIC of 309, demonstrates the best fit among the models
compared. The Spatial Lag model has a higher AIC of 525, followed by the
Spatial Error model at 759, and the OLS regression model at 1435. Since
a lower AIC indicates a better fit, the GWR model appears to be the most
effective in explaining the variance in the data. This suggests that
GWR, which accounts for spatial heterogeneity by allowing model
coefficients to vary across locations, provides a more accurate and
robust fit for this dataset than the other models.
gwrmodel
## Call:
## gwr(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV, data = regDatashps, gweight = gwr.Gauss, adapt = bw,
## hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.00813 (about 13 of 1720 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 9.67276 10.71432 10.95424 11.17420 12.08314 11.11
## PCTVACANT -0.03174 -0.01424 -0.00896 -0.00358 0.01679 -0.02
## PCTSINGLES -0.02497 -0.00755 -0.00166 0.00423 0.01433 0.00
## PCTBACHMOR 0.00110 0.01014 0.01493 0.02022 0.03473 0.02
## LNNBELPOV -0.23652 -0.07336 -0.04012 -0.01267 0.09488 -0.08
## Number of data points: 1720
## Effective number of parameters (residual: 2traceS - traceS'S): 361
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1359
## Sigma (residual: 2traceS - traceS'S): 0.276
## Effective number of parameters (model: traceS): 258
## Effective degrees of freedom (model: traceS): 1462
## Sigma (model: traceS): 0.266
## Sigma (ML): 0.246
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 661
## AIC (GWR p. 96, eq. 4.22): 309
## Residual sum of squares: 104
## Quasi-global R2: 0.848
The choropleth map of local \(R^2\)
values from the GWR model illustrates how the model explains variations
in logged median house values across the city. Darker-colored areas,
such as the northwestern, parts of the northeastern, and far western
regions, indicate a stronger model fit, where the predictors in the
model explain a larger proportion of the variability in house values. In
contrast, lighter-colored areas, particularly in the northern and
western parts of the city, show a weaker model fit, suggesting that the
predictors are less effective in capturing the variability in these
regions.
gwrresults<-as.data.frame(gwrmodel$SDF)
regDatashps$localR2<-gwrresults$localR2
regDatashps_sf <- st_as_sf(regDatashps)
ggplot(data = regDatashps_sf) +
geom_sf(aes(fill = localR2), color = NA) +
scale_fill_gradientn(colors = c("#FAF9F6", "#c44536"),
name = "Local R-Squared",
na.value = "transparent")+
labs(title = "Local R-Squared from GWR", x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

The analysis indicates that the GWR model has successfully reduced
spatial autocorrelation in the residuals compared to the OLS model. In
the distribution of Moran’s I values for GWR residuals, the observed
Moran’s I is close to zero, suggesting minimal spatial autocorrelation
and aligning with the assumption of uncorrelated residuals. In contrast,
the Moran’s I for the OLS residuals is both positive and statistically
significant, indicating a stronger degree of spatial dependence.
The spatial lag model also reduces spatial autocorrelation in the
residuals, although not as effectively as the GWR model. The spatial
error model performs similarly, but the residuals exhibit slightly more
spatial autocorrelation than those of the GWR model. Overall, these
results suggest that the GWR model provides the best fit in terms of
minimizing spatial autocorrelation, capturing local spatial variations
more precisely than the spatial lag and spatial error models. This
indicates that GWR is particularly effective for modeling spatial
heterogeneity in this dataset.
GWR_moranMc<-moran.mc(gwrresults$gwr.e, queenlist, 999, alternative="two.sided")
GWR_moranMc
##
## Monte-Carlo simulation of Moran I
##
## data: gwrresults$gwr.e
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.03, observed rank = 992, p-value = 0.02
## alternative hypothesis: two.sided
The Moran’s I scatter plot for the GWR residuals shows minimal
spatial autocorrelation, as indicated by the nearly horizontal trend of
the line, suggesting that the GWR model has successfully addressed much
of the spatial dependency present in the data. This is consistent with
the earlier histogram results, where the Moran’s I value for GWR
residuals was close to zero, further affirming that the model
effectively captures local spatial variations in logged median house
values.
plt7 <- ggplot(data.frame(res = gwrresults$gwr.e), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept =GWR_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Moran's I of GWR Residuals",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt8 <- ggplot(data = data.frame(
residuals = gwrresults$gwr.e,
spatial_lag = lag.listw(queenlist, gwrresults$gwr.e)
), aes(x = residuals, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Moran's I Scatter Plot for GWR Residuals",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
plt7 + plt8

Looking at each predictors in detail, the vacancy rate generally has
a negative impact on housing values, with significant negative effects
observed in Center City, northwestern areas, parts of the northeastern
region, and the western part of the city. However, there is a localized
area in southern Center City where vacancy rates unexpectedly have a
positive impact on housing values.
Single-family housing, in general, positively impacts housing values,
particularly in the far northeast and far northwest areas, where this
effect is more pronounced. Conversely, in some parts of the city—such as
eastern Philadelphia, parts of the west, and sections of the south—the
impact of single-family housing is negative and statistically
significant.
The percentage of residents with a bachelor’s degree generally has a
positive impact on housing values, with no areas showing a negative
relationship across the city. This positive influence is especially
statistically significant in the northern and northeastern parts of the
city.
Finally, the poverty rate generally has a negative effect on housing
values. This effect is more statistically significant in the southern,
eastern, and northwestern parts of the city.
Overall, these spatial variations highlight how different factors
affect housing values uniquely across Philadelphia, emphasizing the
importance of localized analysis in understanding property dynamics.
regDatashps_sf$coefPCTVACANTst_cat <- cut(regDatashps_sf$coefPCTVACANTst,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
regDatashps_sf$coefPCTSINGLESst_cat <- cut(regDatashps_sf$coefPCTSINGLESst,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
regDatashps_sf$coefPCTBACHMORst_cat <- cut(regDatashps_sf$coefPCTBACHMORst,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
regDatashps_sf$coefLNNBELPOVst_cat <- cut(regDatashps_sf$coefLNNBELPOVst,
breaks = c(-Inf, -2, 0, 2, Inf),
labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
p1 <- ggplot(regDatashps_sf) +
geom_sf(aes(fill = coefPCTVACANTst_cat), color = NA) +
scale_fill_manual(
values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"),
name = "Ratio Category"
) +
labs(title = "PCTVACANT Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p2 <- ggplot(regDatashps_sf) +
geom_sf(aes(fill = coefPCTSINGLESst_cat), color = NA) +
scale_fill_manual(
values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"),
name = "Ratio Category"
) +
labs(title = "PCTSINGLES Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p3 <- ggplot(regDatashps_sf) +
geom_sf(aes(fill = coefPCTBACHMORst_cat), color = NA) +
scale_fill_manual(
values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"),
name = "Ratio Category"
) +
labs(title = "PCTBACHMOR Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p4 <- ggplot(regDatashps_sf) +
geom_sf(aes(fill = coefLNNBELPOVst_cat), color = NA) +
scale_fill_manual(
values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"),
name = "Ratio Category"
) +
labs(title = "LNNBELPOV Coefficient to Standard Error",
x = "", y = "") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
p1 + p2 + p3 + p4 +
plot_layout(ncol = 2)

---
title: "Using Geographically Weighted Regression, Spatial Lag, and Spatial Error to Predict Median House Values in Philadelphia"
author: "Emily Zhou, Ziyi Guo, Emma Jiang"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes

editor_options:
  markdown:
    wrap: sentence
---

Version 2.0 | First Created Oct 22, 2024 | Updated Nov 2, 2024

Keywords: Spatial Error, Spatial Lag, Geographically Weighted Regression, Global & Local Moran's I, Akaiki Information Criterion, Schwarz Criterion, Log Likelihood, Likelihood Ratio Test

GitHub Repository: [CPLN671-GW-Regression](https://github.com/emilyzhou112/CPLN671-GW-Regression)

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r load packages, message=FALSE, warning=FALSE, include=FALSE}

options(scipen=999)
options(digits = 3)

# List of required packages
packages <- c("tidyverse", "sf", "here", "spdep", "spgwr", "spatialreg", 
              "whitestrap", "lmtest", "tseries", "ggplot2", "kableExtra", "patchwork")

# Install and load required packages
package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE, quietly=TRUE)
      library(x, character.only = TRUE)
    }
  }
)

```


```{r load data, message=FALSE, warning=FALSE, include=FALSE}

regData <- st_read(here("data", "RegressionData.shp"))

```

# Introduction

Philadelphia's housing market has undergone significant transformations in recent years, with rising property values creating urgent concerns for urban planners, policymakers, and residents alike. This upward trend has intensified issues of housing affordability, disproportionately impacting low- and middle-income households. The rapid increase in housing values has spurred gentrification, displacing long-term residents and reshaping the socio-economic fabric of various neighborhoods across the city. As researchers have observed, gentrification often leads to social displacement, exacerbating inequality as housing affordability declines (Freeman & Braconi, 2004; Atkinson, 2004). Understanding the factors driving these shifts in property values is essential for addressing affordability challenges, promoting neighborhood stability, and fostering equitable development in Philadelphia (Lees, 2008).

In a previous study, Ordinary Least Squares (OLS) regression was employed to explore the relationships between median house values, the dependent variable, and several key socio-economic predictors. These predictors included educational attainment, vacancy rates, the proportion of detached single-family homes, and poverty levels. Each variable was chosen for its established influence on housing markets and its ability to shed light on underlying socio-economic conditions that may shape property values. For instance, educational attainment is positively associated with economic prosperity and housing demand, as areas with higher educational levels often benefit from higher incomes and greater investment Vacancy rates, on the other hand, are typically linked to neighborhood decline and reduced property values, as vacant properties signal economic distress, discourage investment, and may contribute to higher crime rates (Mallach, 2018). The housing market preference for single-family homes, which offer greater space and privacy, is well-documented in the literature (Glaeser & Gyourko, 2018), while research consistently demonstrates a negative correlation between poverty levels and housing values, with higher poverty rates often linked to decreased demand and underinvestment in local infrastructure (Galster, 2008).

Although OLS regression provides a foundational understanding of these relationships, it has limitations when applied to spatial data, as it assumes independence between observations and ignores potential spatial dependencies. Housing values in one area are often influenced by those in nearby areas, resulting in spatial autocorrelation that can lead to biased or misleading results when using traditional OLS methods. Spatial autocorrelation reflects Tobler’s First Law of Geography, which states that “everything is related to everything else, but near things are more related than distant things” (Tobler, 1970). When spatial dependencies are ignored, as is the case in OLS regression, the estimates may suffer from omitted variable bias, yielding inaccurate conclusions about the relationships between variables (Anselin, 1988).

To address these limitations, this report applies spatial econometric techniques, specifically spatial lag, spatial error, and geographically weighted regression (GWR) models, to more accurately capture the spatial dependencies affecting housing values in Philadelphia. The spatial lag model incorporates the influence of neighboring values directly into the regression, allowing for an understanding of how housing values in one area may affect those in adjacent areas (LeSage & Pace, 2009). The spatial error model accounts for spatial autocorrelation in the residuals, isolating unobserved spatially correlated factors that may influence housing values (Anselin, 1988). Lastly, the geographically weighted regression (GWR) model offers a localized perspective, allowing coefficients to vary by location and capturing the heterogeneity of relationships across different neighborhoods (Brunsdon, Fotheringham, & Charlton, 1996). By employing these spatial techniques, this study aims to enhance the accuracy of the initial OLS findings and provide a more comprehensive understanding of the socio-economic and spatial factors influencing housing values. These insights will support more effective policy interventions and urban development strategies aimed at achieving equitable and sustainable growth in Philadelphia.

# Methods 

## Spatial Autocorrelation

The First Law of Geography, proposed by Waldo Tobler, states that *"everything is related to everything else, but nearer things are more related than distant things."* This law captures the principle of spatial dependence, which underpins many spatial analyses, including spatial regression models. One way to measure spatial dependence is through Moran's I, a statistic that quantifies spatial autocorrelation. Moran's I for a variable is calculated as follows: 


$$
I = \frac{n}{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}} \cdot \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (X_i - \bar{X})(X_j - \bar{X})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}
$$
where $n$ is the number of observations, $X_i$ and $X_j$ are the values of the variable at locations $i$ and $j$, respectively, $\bar{X}$ is the mean of the variable, and $w_{ij}$ is the spatial weight between locations $i$ and $j$.

Here, we use a spatial weight matrix constructed using a "Queen" contiguity method, which defines each unit's neighbors based on shared boundaries or vertices. This matrix remains consistent across the analysis, although statisticians may use different weight matrices to assess model sensitivity to neighbor definitions.

Testing the significance of spatial autocorrelation involves evaluating whether the observed Moran's I differs from what would be expected under spatial randomness. In hypothesis testing, the null hypothesis (\( H_0 \)) is that there is no spatial autocorrelation, while the alternative (\( H_a \))  posits the presence of spatial autocorrelation. Using random permutations of data values across locations, we generate a reference distribution of Moran's I under the null hypothesis and compare the observed statistic to this distribution.

Beyond global spatial autocorrelation measured by Moran's I, local spatial autocorrelation identifies specific areas with clustering or dispersion. The significance of local Moran's I is tested similarly using random permutations, and results can highlight statistically significant clusters or outliers that global measures might miss.

## Ordinary Least Squares Regression

Ordinary Least Squares (OLS) regression is a statistical technique for estimating the relationship between a dependent variable and one or more independent variables by minimizing the squared differences between observed and predicted values. Key assumptions of OLS include linearity, independence of observations, homoscedasticity (constant error variance), normality of errors, and no multicollinearity among predictors.

In our first assignment, we used OLS regression to assess how vacancy rates, single-family housing percentage, educational attainment, and poverty levels influence median house value. All predictors were statistically significant, with vacancy rates and poverty levels negatively affecting house values, while single-family housing and educational attainment had positive effects. The model’s \(R^2\) was 0.66, explaining 66% of the variance in house values. However, some predictors exhibited non-linear patterns, and spatial autocorrelation suggested dependence among observations, indicating that future models could benefit from spatial regression techniques.

When data has a spatial component, the assumption that errors are random and independent often doesn’t hold, as nearby observations may exhibit similar error patterns. To test this, we can examine the spatial autocorrelation of the residuals using Moran’s I, which quantifies the degree of clustering in residuals. Another approach is to regress the residuals on nearby residuals from neighboring areas, such as block groups defined by the Queen matrix, to identify any spatial dependence. These tests help determine if spatial autocorrelation is present, indicating a need for spatial regression techniques.

In R, there are methods to test other key regression assumptions. For homoscedasticity (constant error variance), which is related to the independence of errors, we use the **Breusch-Pagan Test**, the **Koenker-Bassett Test**, and the **White Test**. The null hypothesis (\( H_0 \)) is that errors are homoscedastic, while the alternative hypothesis (\( H_a \)) is that errors exhibit heteroscedasticity. For normality of errors, we can use the **Jarque-Bera test**. The null hypothesis is that residuals are normal, and the alternative hypothesis is that they are not normal. We want to not be able to reject the null hypothesis (i.e., get a p-value of 0.05 or higher).


## Spatial Lag and Spatial Error Regression

In this report, we use R to run spatial lag and spatial error regressions. Among the two methods, spatial lag regression assumes the value of the dependent variable at one location is associated with the values of that variable in nearby locations, where nearby is as defined by the weights matrix W (rook, queen, within a certain distance of one another). In short, it introduces a spatially lagged dependent variable into the model, compared to our previous OLS regression. In our context, the spatial lag model could be represented as the following: 

$$
\text{LNMEDHVAL} = \rho W \cdot \text{LNMEDHVAL} + \beta_0 + \beta_1 \cdot \text{PCTVACANT} + \beta_2 \cdot \text{PCTSINGLES} + \beta_3 \cdot \text{PCTBACHMOR} + \beta_4 \cdot \text{LNNBELPOV100} + \epsilon_i
$$

where \(\text{LNMEDHVAL}\) is the log of median home value in location, \( \rho \) is the spatial lag coefficient that measures the influence of neighboring areas, \( W \) is the spatial weights matrix (in this case, the queenlist spatial weights), and \( W \cdot \text{LNMEDHVAL} \) is the spatially lagged dependent variable. The other terms are the same as in the OLS regression, where \(\text{PCTVACANT}\), \(\text{PCTSINGLES}\), \(\text{PCTBACHMOR}\), and \(\text{LNNBELPOV}\) are the predictors, \( \beta_1, \beta_2, \beta_3, \beta_4 \) are the coefficients, \( \beta_0 \) is the intercept term, and \( \epsilon_i \) is the error term.

The spatial error model, on the other hand, assumes that the error term is spatially autocorrelated, meaning that the residual in one location is associated with residuals at nearby locations, where nearby is also deifined by the weight matrix. The spatial error model could be represented as the following:

$$
\text{LNMEDHVAL} = \beta_0 + \beta_1 \cdot \text{PCTVACANT} + \beta_2 \cdot \text{PCTSINGLES} + \beta_3 \cdot \text{PCTBACHMOR} + \beta_4 \cdot \text{LNNBELPOV100} + \lambda W \cdot \epsilon + u
$$

where \(\text{LNMEDHVAL}\) is the log of median home value, \( \beta_0 \) is the intercept term, \(\text{PCTVACANT}\), \(\text{PCTSINGLES}\), \(\text{PCTBACHMOR}\), and \(\text{LNNBELPOV}\) are the predictors, \( \beta_1, \beta_2, \beta_3, \beta_4 \) are the coefficients, as in OLS regression. \( \lambda \) is the spatial error coefficient that measures the degree of spatial correlation in the error term, \( W \) is the spatial weights matrix, \( W \cdot \epsilon \) is the spatially lagged error term, and \( u \) is the random noise. To put it simply, we are regressing residuals on the nearest neighbor residuals, thereby filtering the spatial information out of the OLS residuals

Spatial error regression and spatial lag regression both require the standard assumptions for OLS regression—such as linearity, homoscedasticity, and normality of errors—except for the assumption of spatial independence among observations. This adjustment allows the model to account for spatial structure in either the dependent variable (spatial lag) or the error term (spatial error). That said, their goal is to account for spatial dependence in the data, aiming to reduce spatial autocorrelation in the regression residuals. Both methods minimize spatial patterns in residuals that could otherwise lead to biased or inefficient estimates.

We compare the results of spatial lag and spatial error regression with OLS to decide whether the spatial models perform better than OLS based on a number of criteria: Akaike Information Criterion, Schwarz Criterion, Log Likelihood, and Likelihood Ratio Test. The **Akaike Information Criterion (AIC)** and **Schwarz Criterion (SC or BIC)** are used to compare the goodness of fit of different models. They are relative measures of the information that is lost when a given model is used to describe reality and can be said to describe the tradeoff between precision and complexity of the model. The lower the AIC or SC, the better the model.

The **Log-Likelihood** is associated with the maximum likelihood method of fitting a statistical model to the data and estimating model parameters. Maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them.  Higher log-likelihood values indicate a model that better explains the observed data. 

The **Likelihood Ratio Test** is used to formally test whether adding spatial dependence to a model (as in spatial lag or spatial error models) significantly improves model fit compared to OLS. For this test, the null hypothesis (\( H_0 \)) state that the spatial model does not provide a significantly better fit than OLS, while the alternative hypothesis (\( H_a \)) states that the spatial model provides a significantly better fit than OLS. 

The decision rule is to reject the null hypothesis if the \( LR \) test statistic is significant (i.e., the p-value is below a chosen significance level, typically 0.05), and conclude that the spatial model is a better fit than OLS. If not, OLS may be adequate. *It should be noted that the log likelihood method and the likelihood ratio test should be used for comparing nested models. Spatial lag and spatial error are not a special case of each other – we cannot use the log likelihood ratio to compare them.*

Alternatively, we can also compare the spatial models to OLS using the **Moran's I** statistic, which measures the spatial autocorrelation of the residuals. Moran's I ranges from -1 to 1, where -1 indicates perfect dispersion, 0 indicates no spatial autocorrelation, and 1 indicates perfect correlation. For our models, the goal is to minimize spatial autocorrelation in the residuals.  If the Moran’s I statistic for the residuals of a spatial lag or spatial error model is closer to zero than the Moran’s I for the OLS model, we can conclude that the spatial model better captures spatial dependencies. 


## Geographically Weighted Regression

We will conduct our Geographically Weighted Regression (GWR) analyses in R. GWR is a form of local regression that helps address spatial heterogeneity in data, which is essential when analyzing spatial data prone to Simpson’s Paradox — where a trend observed in aggregate data can differ from trends in subsets. GWR allows us to examine relationships at a local level rather than assuming they are uniform across the study area. The general GWR equation can be expressed as:


$$
y_i = \beta_{i0} + \sum_{k=1}^{m} \beta_{ik}x_{ik} + \epsilon_i
$$

where: \( y_i \) is the dependent variable at location \(i\), \( \beta_{i0} \) is the intercept for location \(i\), allowing a unique baseline for each location, \( \beta_{ik} \) is the coefficient for the \(k\)-th predictor at location \(i\), \( x_{ik} \) is the value of the \(k\)-th predictor variable at location \(i\), and \( \epsilon_i \) is the error term at location \(i\).

In GWR, local regression is performed by fitting a regression model at each observation point, using a subset of neighboring points, with weights assigned based on their distance from the focal point. The concept of bandwidth controls the number of neighbors included, influencing how "local" each regression is. There are two types of bandwidths: adaptive and fixed. A fixed bandwidth uses a constant distance for all points, while an adaptive bandwidth varies, adjusting to include a set number of nearby observations regardless of spatial density. Here, we will use adaptive bandwidth, which is more appropriate as it accounts for varying spatial densities, providing a flexible analysis that better captures local relationships in areas with different population distributions.

Although GWR allows for spatial variation in relationships, the standard OLS assumptions (linearity, independence, homoscedasticity, and normality) still apply. Multicollinearity is assessed using the Condition Number, and high multicollinearity can cause issues in GWR, leading to unstable estimates and clustering in parameter estimates. It is also important to note that GWR does not provide p-values for coefficients, as the model focuses on exploring spatial patterns rather than testing global hypotheses.


# Results

## Global and Local Moran's I

The Global Moran's I analysis for the dependent variable, \(\text{LNMEDHVAL}\) (the natural log of median house value), reveals a pronounced level of spatial autocorrelation. With a Moran's I statistic of 0.8, the results indicate a strong positive spatial autocorrelation. This high value suggests that areas with similar median house values tend to cluster geographically within the study area. In other words, neighborhoods with either high or low house values are more likely to be located near other neighborhoods with similar values, rather than being randomly distributed across space. This spatial clustering points to the presence of spatial dependencies in housing values, possibly driven by neighborhood characteristics, socio-economic factors, or other spatial processes influencing property values across the region.

To validate the significance of this observed spatial autocorrelation, a Monte Carlo permutation test was conducted using 1000 simulations. This approach involved randomly permuting the values of \(\text{LNMEDHVAL}\) across spatial units to generate a distribution of Moran's I values under the null hypothesis of no spatial autocorrelation. The results, visualized in a histogram of permuted Moran’s I values, show that the observed Moran’s I of 0.8 lies at the extreme end of this distribution, marked in red. With an observed rank of 1000 (the highest rank in the distribution), the observed Moran’s I value exceeded all permuted values, emphasizing the extremity of the spatial clustering in the actual data.

The test result is further supported by a highly significant p-value (\( p < 2 \times 10^{-16} \)). This exceptionally low p-value strongly rejects the null hypothesis of no spatial autocorrelation, confirming that the spatial arrangement of \(\text{LNMEDHVAL}\) values is not random. Instead, the observed clustering is statistically significant, indicating that spatial processes are likely influencing the distribution of housing values in the study area.

```{r construct queen neighbors, message=FALSE, warning=FALSE, include=FALSE}

queen<-poly2nb(regData, row.names=regData$POLY_ID)
queenlist<-nb2listw(queen, style = 'W')

```


```{r global moran I, message=FALSE, warning=FALSE}

globalmoranMC<-moran.mc(regData$LNMEDHVAL, queenlist, nsim=999, alternative="two.sided") 
globalmoranMC

```

This graph shows the results of the Monte Carlo permutation test to assess the significance of the observed Moran’s I statistic.The histogram shows the frequency of Moran's I values generated by these random permutations, with the observed Moran’s I value highlighted in red. The observed statistic of 0.8 stands far to the right of the permuted values, underscoring its extremity and significance. The high observed rank (1000) indicates that none of the permuted values exceeded the actual Moran’s I. This extremely low p-value provides strong evidence against the null hypothesis of no spatial autocorrelation, confirming that the spatial clustering observed in \(\text{LNMEDHVAL}\) is statistically significant and unlikely to be due to random variation.

```{r global moran histogram plot, message=FALSE, warning=FALSE}

ggplot(data.frame(res = globalmoranMC$res), aes(x = res)) +
  geom_histogram(bins = 100, fill = "#283d3b") +
  geom_vline(xintercept = globalmoranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
  labs(title = "Observed and Permuted Global Moran's I",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

```

This graph offers a different perspective on the spatial structure of \(\text{LNMEDHVAL}\) by displaying a scatter plot of the variable’s values against their spatial lag (a measure of neighboring values). The x-axis represents the logged median house values \(\text{LNMEDHVAL}\), while the y-axis displays the spatially lagged values of  \(\text{LNMEDHVAL}\), computed based on a queen contiguity spatial weights matrix that considers neighboring spatial units. The positive slope of the red trend line in the scatter plot indicates a positive spatial autocorrelation, where areas with higher median house values are typically surrounded by other areas with high values, and similarly, areas with lower values are near other low-value areas. This linear relationship between a location’s \(\text{LNMEDHVAL}\) and the average values in surrounding locations highlights the clustering of similar values and supports the result from the Global Moran’s I statistic.

```{r global moran scatter plot, message=FALSE, warning=FALSE}

ggplot(data = data.frame(
  LNMEDHVAL = regData$LNMEDHVAL,
  spatial_lag = lag.listw(queenlist, regData$LNMEDHVAL)
), aes(x = LNMEDHVAL, y = spatial_lag)) +
  geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +  
  geom_smooth(method = "lm", color = "#c44536", se = FALSE) + 
  labs(title = "Global Moran's I Scatter Plot",
       x = "Logged Median House Value",
       y = "Spatial Lag of LNMEDHVAL") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

```


```{r compute local moran I, message=FALSE, warning=FALSE, include=FALSE}

localmoran <-localmoran(regData$LNMEDHVAL , queenlist)
localmoran <-cbind(regData, as.data.frame(localmoran))

```

In a Local Moran's I significance map, areas with significant low p-values refers to area where the spatial autocorrelation is statistically significant, either hotspots of concentrated high values or cold spots of low values. There areas are mostly found in the northwestern and central parts of the city. As we may see, these areas are then surrounded by areas with slightly higher p-values and then areas with even higher p-values, until p becomes insignificant. 

```{r local moran significance map, message=FALSE, warning=FALSE}

moranSig.plot <- function(df, listw, title) {
  
  local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
  
  df$Pr.z <- local[,  "Pr(z != E(Ii))"]  
  
  df$pval_category <- cut(df$Pr.z, 
                          breaks = c(0, 0.001, 0.01, 0.05, 1), 
                          labels = c("0.000 - 0.001", "0.001 - 0.010", "0.010 - 0.050", "0.050 - 1.000"), 
                          include.lowest = TRUE)
  
  if (!inherits(df, "sf")) {
    df <- st_as_sf(df)
  }
  
  ggplot(data = df) +
    geom_sf(aes(fill = pval_category), color = NA, alpha = 0.9) +
    scale_fill_brewer(type = "div", palette = 6, name = "P-Value") +
    labs(title = title) +
    theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
}

moranSig.plot(localmoran, queenlist, 'Significance Map of Local Moran I')

```

The Cluster Map derived from the Local Moran’s I analysis classified census tracts in Philadelphia into  high-high, high-low, low-high, low-low, or not significant, representing different types of spatial relationships in house values within neighborhoods and their surroundings.

**High-High Clusters**: Areas classified as high-high are those where high values of median house prices are surrounded by other high-value areas. These clusters are prominently located in the northwestern parts of the city and some central regions, indicating pockets of economic affluence. The high concentration of high-value properties in these regions suggests established, affluent neighborhoods where housing prices remain high due to demand and possibly the presence of amenities or other attractive urban features.

**Low-Low Clusters**: Low-low clusters represent areas where low property values are surrounded by other low-value areas, highlighting economically disadvantaged zones. These clusters are predominantly found in the southwestern and northeastern parts of the city. The spatial clustering of low-value properties in these areas suggests neighborhoods that may face economic challenges, possibly with limited access to amenities or fewer investment opportunities. These regions may require targeted policy interventions to address underlying issues that contribute to the lower property values.

**High-Low Clusters**: High-low clusters are transitional zones where high-value areas are adjacent to lower-value areas. These areas, marked in light red on the map, are generally scattered around the boundaries of high-value neighborhoods, such as in sections of central and northwest Philadelphia. The proximity of high-value properties to lower-value ones in these clusters can indicate economic contrasts or areas experiencing gentrification, where property values in traditionally lower-income neighborhoods may be increasing due to spillover effects from nearby affluent areas.

**Low-High Clusters**: Low-high clusters, where low-value properties are surrounded by higher-value areas, are less common but appear in the northern and eastern parts of the city. These clusters suggest isolated pockets of economic disadvantage within more affluent areas. This pattern may indicate areas that are yet to benefit from surrounding economic growth or might be experiencing challenges that prevent them from aligning with the prosperity of neighboring regions.

**Not Significant Areas**: Lastly, the not significant areas, shaded in gray, indicate neighborhoods where the local Moran’s I statistic was not significant. These regions are scattered throughout the city, representing areas where house values do not exhibit strong spatial clustering. The lack of significant clustering in these areas suggests a more random distribution of house values, which may occur in more mixed-use or transitional neighborhoods where economic characteristics are varied.


```{r local moran cluster map,message=FALSE, warning=FALSE}
hl.plot <- function(df, listw) {

  local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
  quadrant <- vector(mode = 'numeric', length = nrow(df))  
  
  m.prop <- df$LNMEDHVAL - mean(df$LNMEDHVAL)
  m.local <- local[, 1] - mean(local[, 1])
  signif <- 0.05
  
  quadrant[m.prop > 0 & m.local > 0] <- 1  # high-high
  quadrant[m.prop < 0 & m.local < 0] <- 2  # low-low
  quadrant[m.prop < 0 & m.local > 0] <- 4  # low-high
  quadrant[m.prop > 0 & m.local < 0] <- 3  # high-low
  quadrant[local[, 5] > signif] <- 5  # insignificant
  
  df$quadrant <- factor(quadrant, levels = c(1, 3, 5, 2, 4), 
                        labels = c("High-High", "High-Low", "Non-Significant", "Low-Low", "Low-High"))
                          
  if (!inherits(df, "sf")) {
    df <- st_as_sf(df)
  }
  
  ggplot(data = df) +
    geom_sf(aes(fill = quadrant), color = "#848884", lwd = 0.07) +
    scale_fill_brewer(type = "div", palette = 6, name = "Cluster Type") + 
    labs(title = "Local Moran's I Cluster Map") +
    theme(legend.position="right",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
}

hl.plot(regData, queenlist)

```

## OLS Regression Results

Below, we will provide a quick walk through of the OLS regression results. Detail interpretation and dicussion of the results can be found in the [GitHub repository](https://github.com/emilyzhou112/CPLN671-OLS-Regression) for assignment 1.

All predictors in our OLS regression, which include \(\text{PCTSINGLES}\), \(\text{PCTVACANT}\), \(\text{LNNBELPOV}\), and \(\text{PCTBACHMOR}\), are statistically significant in explaining \(\text{LNMEDHVAL}\). The model accounts for approximately 66.2% of the variance, as indicated by the R-squared value of 0.662.


```{r ols regression, message=FALSE, warning=FALSE}

OLS <- lm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData)
fitted_values <- fitted(OLS)
residuals_values <- residuals(OLS)
standardized_residuals <- rstandard(OLS)
resnb<-sapply(queen, function(x) mean(standardized_residuals[x]))
regData <- regData %>%
  mutate(
    Fitted = fitted_values,
    Residuals = residuals_values,
    Standardized_Residuals = standardized_residuals,
    Residuals_NB = resnb)

summary(OLS)

```

The results from the Breusch-Pagan, Koenker-Bassett, and White's tests all detect heteroscedasticity in the model, as all p-values are extremely low. This confirms a clear violation of the homoscedasticity assumption in OLS regression. 

```{r heteroscedasticity tests, message=FALSE, warning=FALSE}

# Breusch-Pagan Test
bptest(OLS, studentize=FALSE)

# Koenker-Bassett Test
bptest(OLS) 

# White Test
white_test(OLS)


```

Comparing to the scatter plot we made to identify heteroscedasticity from the previous assignment, the residuals in this scatter plot show no discernible trend of rising or falling variance with increasing fitted values; instead, they seem to be pretty uniformly distributed around the horizontal line at zero. Regardless, the Breusch-Pagan test is often more sensitive than a residual plot. Even small deviations that are hard to see visually may be detected.


```{r heteroscedasticity plot, message=FALSE, warning=FALSE}

ggplot(regData, aes(x = Fitted, y = Standardized_Residuals)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +    
  geom_hline(yintercept = 0, linetype = "dashed", color = "#c44536", size = 1) +   #
  labs(
    title = "Scatter Plot of Standardized Residuals vs Fitted Values",
    x = "Fitted Values",
    y = "Standardized Residuals"
  ) +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))
```

The results of the Jarque-Bera test indicate a significant deviation from normality in the residuals. The p-value is well below typical significance levels, and rejects the null hypothesis that the residuals are normally distributed. Therefore, the test results suggest that the residuals do not meet the normality assumption required for OLS regression.

```{r normality of errors, message=FALSE, warning=FALSE}

# Jarque-Bera Test 
jarque.bera.test(OLS$residuals)

```

Comparing to the histogram of standardized residuals from the previous assignment, we may also notice that the residuals are not perfectly normal. The histogram reveals a slight departure from a normal distribution and is left skewed. Despite that the skewness is not very pronounced, it is still consistent with the Jarque-Bera test results. 


```{r histogram of residuals, message=FALSE, warning=FALSE}

ggplot(regData, aes(x = Standardized_Residuals)) +
  geom_histogram(bins = 30, fill = "#283d3b", alpha = 0.9) +
  labs(title = "Histogram of Standardized Residuals", 
       x = "Standardized Residuals", 
       y = "Frequency") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))
```

The scatterplot below shows the relationship between the standardized residuals of the OLS model and the residuals of their nearest neighbors. The upward-sloping trend line indicates a positive relationship between these residuals and their neighbors, suggesting spatial autocorrelation. The coefficient for the residuals of their nearest neighbors is 0.7323, with a highly significant p-value, demonstrating a strong positive relationship. This implies that residuals are positively correlated with those of their closest neighbors, implying spatial dependence.

```{r ols residuals vs nearest neighbor residuals, message=FALSE, warning=FALSE, include=FALSE}

res.lm <- lm(Standardized_Residuals ~ Residuals_NB, data=regData)
summary(res.lm)

```


```{r ols residuals vs nearest neighbor residuals plot, message=FALSE, warning=FALSE}
ggplot(regData, aes(x = Residuals_NB, y = Standardized_Residuals)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "#c44536", size = 1) +
  labs(title = "Residuals vs. Nearest Neighbor Residuals",
       x = "Nearest Neighbor Residuals",
       y = "Standardized Residuals") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))
```

The Moran’s I scatterplot and the results from the permutations for OLS regression residuals both indicate significant spatial autocorrelation. The scatterplot shows a positive relationship between the logged median house values and their spatially lagged values, with a positive slope, confirming that high values tend to be near other high values, and low values cluster with other low values. The Moran's I statistic from the Monte Carlo simulation is approximately 0.3, with an extremely low p-value, confirming that this spatial autocorrelation is statistically significant.

This significant spatial autocorrelation in the OLS residuals is problematic because it violates the OLS assumption of independent residuals. When residuals are spatially autocorrelated, it suggests that the model is missing spatial structure in the data, which could lead to biased or inefficient estimates.


```{r moran I of OLS residuals, message=FALSE, warning=FALSE}

OLS_moranMC<-moran.mc(standardized_residuals, queenlist, nsim=999, alternative="two.sided") 
OLS_moranMC

```
Both the Moran’s I statistic and the positive beta coefficient from the regression of standardized residuals on their nearest neighbors tell a similar story. They both reveal that spatial dependence is present in the residuals, suggesting that a spatial model, such as Geographically Weighted Regression (GWR) or a spatial autoregressive model, would be more appropriate for capturing the underlying spatial relationships in the data.


```{r moran I of OLS residuals histogram and scatterplot, fig.width=10, message=FALSE, warning=FALSE}

plt1 <- ggplot(data.frame(res = OLS_moranMC$res), aes(x = res)) +
  geom_histogram(bins = 100, fill = "#283d3b") +
  geom_vline(xintercept =OLS_moranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
  labs(title = "Observed and Permuted Moran's I of OLS Residuals",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt2 <- ggplot(data = data.frame(
  residuals =standardized_residuals,
  spatial_lag = lag.listw(queenlist, standardized_residuals)
), aes(x = residuals, y = spatial_lag)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +  
  geom_smooth(method = "lm", color = "#c44536", se = FALSE) + 
  labs(title = "Moran's I Scatter Plot for OLS Residuals",
       x = "Logged Median House Value",
       y = "Spatial Lag of LNMEDHVAL") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt1 + plt2
```


## Spatial Lag and Error Regression Results

### Spatial Lag Regression

In the spatial lag regression output, we may see that the term \(\text{LNMEDHVAL}\), represented by \( \rho \) in the model, has an estimate of 0.651 and an extremely significant p-value (\( p < 2.22 \times 10^{-16} \)).This indicates that nearby values of the dependent value influence each other. The significant positive coefficient (\( \rho = 0.651 \)) suggests a substantial positive spatial lag effect, meaning that higher median home values in neighboring areas are associated with higher median home values in the area being examined.

All predictors in the spatial lag model are significant. \(\text{PCTVACANT}\) has a coefficient of -0.0085 and is highly significant since \( p < 2.22 \times 10^{-16} \). It means a higher vacancy rate is associated with lower median home values. \(\text{PCTSINGLES}\) has a coefficient of 0.0020, significant with a \( p < 0.0001 \). The positive association suggests that areas with a higher proportion of single households may slightly increase median home values. \(\text{PCTBACHMOR}\) has a coefficient of 0.0085 and is extremely significant (\( p < 2.22 \times 10^{-16} \)). Higher education levels are positively associated with home values, meaning that more educated areas tend to have higher home values. \(\text{LNNBELPOV}\) has a coefficient of -0.0341, also highly significant (\( p < 1 \times 10^{-7} \)). The negative association suggests that higher poverty levels are associated with lower median home values.

The OLS model also shows significant coefficients for all predictors, but with larger effect sizes. For example, \text{PCTVACANT} has a coefficient of -0.0192 in OLS, compared to -0.0085 in spatial lag. This suggests that OLS overestimates the influence of these predictors due to omitted spatial dependence. For the other predictors, the OLS coefficients are also larger in magnitude than the spatial lag coefficients. 

```{r spatial lag regression, message=FALSE, warning=FALSE}

SL <- lagsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData, queenlist)
summary(SL)

```

We ran Breusch-Pagan test to assess whether the residuals of the spatial lag regression model exhibit heteroscedasticity. As we may see, the test statistic for the spatial lag regression is \( BP = 211 \) with \( df = 4 \), and \( p < 2 \times 10^{-16} \). Given the extremely low p-value, we reject the null hypothesis of homoscedasticity, indicating that residuals in the spatial lag regression model remain heteroscedastic.


```{r SL heteroscedasticity tests, message=FALSE, warning=FALSE}

# Breusch-Pagan Test
bptest.Sarlm(SL, studentize=FALSE)

```

When comparing the OLS and spatial lag regression models, several metrics show the spatial lag model provides a better fit. The spatial lag model has a significantly lower AIC (525) than OLS (1435), indicating a better model fit by penalizing less for added complexity. Similarly, the Schwarz Criterion is much lower for the spatial lag model, suggesting it captures the structure of the data more effectively than the OLS model. The spatial lag model (-256) has a higher log likelihood than OLS (-711), meaning that it better explains the variability in the data.

```{r compare OLS and SL, message=FALSE, warning=FALSE}

# Akaike Information Criterion
aic_ols <- AIC(OLS)
aic_sl <- AIC(SL)

# Schwarz Criterion
bic_ols <- BIC(OLS)
bic_sl <- BIC(SL)

# The Log Likelihood
loglik_ols <- logLik(OLS)
loglik_sl <- logLik(SL)

results <- data.frame(
  Model = c("OLS Regression", "Spatial Lag Regression"),
  AIC = c(aic_ols, aic_sl),
  SchwarzCriterion = c(bic_ols, bic_sl),
  LogLikelihood = c(loglik_ols, loglik_sl)
)

results %>%
  kable(row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 


```

The Likelihood Ratio Test assesses the improvement in model fit between the OLS and spatial lag models, and the result we get here shows \( LR = 912 \) with \( df = 1 \), and \( p < 2 \times 10^{-16} \). With a large test statistic and low p-value, we conclude that the spatial lag model significantly improves fit over the OLS model. 

```{r log likelihood ratio test for SL, message=FALSE, warning=FALSE}

# The Likelihood Ratio Test
lr_test <- LR.Sarlm(SL, OLS)
lr_test

```

Finally, we examined Moran's I for the spatial lag regression model's residuals.The observed Moran’s I of -0.08 suggests a small negative spatial autocorrelation in the residuals of the spatial lag model. With \( p = 0.002 \), we reject the null hypothesis of no spatial autocorrelation. This negative Moran’s I indicates that residuals are distributed with slight spatial dispersion rather than clustering, suggesting the spatial lag model effectively accounts for most spatial dependencies in the data.

```{r moran I of SL residuals, message=FALSE, warning=FALSE}

SL_moranMc<-moran.mc(SL$residuals, queenlist,999, alternative="two.sided")
SL_moranMc

```

Both the residual histograms as well as the scatter plot support this conclusion. The histogram of residuals is also approximately normally distributed, and the scatter plot of residuals against fitted values shows no clear pattern or trend. All of these suggest that the spatial lag model is a good fit for the data compared to OLS. 

```{r moran I of SL residuals histogram and scatterplot, fig.width=10, message=FALSE, warning=FALSE}

plt3 <- ggplot(data.frame(res = SL$residuals), aes(x = res)) +
  geom_histogram(bins = 100, fill = "#283d3b") +
  geom_vline(xintercept =SL_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
  labs(title = "Observed and Permuted Moran's I of SL Residuals",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt4 <- ggplot(data = data.frame(
  residuals =SL$residuals,
  spatial_lag = lag.listw(queenlist, SL$residuals)
), aes(x = residuals, y = spatial_lag)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +  
  geom_smooth(method = "lm", color = "#c44536", se = FALSE) + 
  labs(title = "Moran's I Scatter Plot for SL Residuals",
       x = "Logged Median House Value",
       y = "Spatial Lag of LNMEDHVAL") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt3 + plt4
```


### Spatial Error Regression

In the spatial error regression model, the term \(\lambda\) represents the spatial autocorrelation within the model’s error terms. In this case, \(\lambda\) is estimated at 0.815 and is statistically significant, with \( p < 2.22 \times 10^{-16} \). This highly significant and large value of \(\lambda\) suggests a strong spatial dependency in the error terms. 

Examining the individual predictors in the model, all variables are statistically significant at a high confidence level. \(\text{PCTVACANT}\) has an estimate of -0.0058 (\( p < 6.94 \times 10^{-9} \)), indicating that vacancy rates negatively affect median home values. \(\text{PCTSINGLES}\) has a positive coefficient of 0.0027 (\( p < 1.61 \times 10^{-5} \)), suggesting a small positive association with median home values. \(\text{PCTBACHMOR}\) , representing the percentage of bachelor’s degrees or higher, shows a positive effect with an estimate of 0.0098 (\( p < 2.20 \times 10^{-16} \)), indicating that higher education levels correlate with higher home values. \(\text{LNNBELPOV}\) has a negative estimate of -0.0345 and \( p < 1.11 \times 10^{-6} \), implying that higher levels of neighborhood poverty are associated with lower median home values.

Similar to spatial lag regression, when comparing these results with the OLS regression model, we observe that the coefficients for our predictors are all smaller in magnitude than their OLS counterparts. For instance, \text{PCTVACANT} had a coefficient of -0.019156 in the OLS model, while in the spatial error model, it is reduced to -0.0058. This reduction in coefficient magnitude suggests that the spatial error model provides a more conservative estimate of the effects of these predictors on median home values, likely due to the accounted spatial autocorrelation.


```{r spatial error regression, message=FALSE, warning=FALSE}

SE <- errorsarlm(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, data=regData, queenlist)
summary(SE)

```

Based on the results of the Breusch-Pagan test, the spatial lag regression residuals are indeed heteroscedastic. The test statistic is \( BP = 23 \) with \( df = 4 \), and \( p < 0.0001 \). Since the p-value is significantly lower than the conventional alpha levels (0.01, 0.05), we reject the null hypothesis of homoscedasticity. This indicates that the variance of the residuals is not constant across observations, suggesting the presence of heteroscedasticity in the model's residuals.

```{r SE heteroscedasticity tests, message=FALSE, warning=FALSE}

# Breusch-Pagan Test 
bptest.Sarlm(SE, studentize=FALSE)

```

We also compared the spatial error model to the OLS model using the AIC, SC, log-likelihood, and likelihood ratio test. The spatial error model has a lower AIC (759) and SC (798) than the OLS model (1435 and 1468, respectively), indicating a better fit. The log-likelihood of the spatial error model (-373) is also higher than that of the OLS model (-711), suggesting that the spatial error model better explains the variability in the data.

```{r compare OLS SL and SE, message=FALSE, warning=FALSE}

# Akaike Information Criterion
aic_se <- AIC(SE)

# Schwarz Criterion
bic_se <- BIC(SE)

# The Log Likelihood
loglik_se <- logLik(SE)

results <- data.frame(
  Model = c("OLS Regression", "Spatial Lag Regression", "Spatial Error Regression"),
  AIC = c(aic_ols, aic_sl, aic_se),
  SchwarzCriterion = c(bic_ols, bic_sl, bic_se),
  LogLikelihood = c(loglik_ols, loglik_sl, loglik_se)
)

results %>%
  kable(row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 


```

According to the Likelihood Ratio Test, we see \( LR = 678 \) with \( df = 1 \), and \( p < 2 \times 10^{-16} \). With a large test statistic and low p-value, we may also conclude that the spatial error model significantly improves fit over the OLS model. 

```{r log likelihood ratio test for SE, message=FALSE, warning=FALSE}

lr_test <- LR.Sarlm(SE, OLS)
lr_test

```

Finally, we examined Moran's I for the spatial error regression model's residuals. The observed Moran’s I of -0.09 suggests a small negative spatial autocorrelation in the residuals of the spatial error model. 
With \( p = 0.002 \), we reject the null hypothesis of no spatial autocorrelation. This negative Moran’s I indicates that residuals are distributed with slight spatial dispersion rather than clustering, suggesting the spatial error model effectively accounts for most spatial dependencies in the data.


```{r moran I of SE residuals, message=FALSE, warning=FALSE}

SE_moranMc<-moran.mc(residuals(SE), queenlist,999, alternative="two.sided")
SE_moranMc

```

Both the residual histograms as well as the scatter plot support this conclusion here again. The histogram of residuals is approximately normally distributed, and the scatter plot of residuals against fitted values shows a much less evident trend compared to the OLS model. Therefore, all of these suggest that the spatial error model is a good fit for the data compared to OLS.

```{r moran I of SE residuals histogram and scatterplot, fig.width=10, message=FALSE, warning=FALSE}

plt5 <- ggplot(data.frame(res = residuals(SE)), aes(x = res)) +
  geom_histogram(bins = 100, fill = "#283d3b") +
  geom_vline(xintercept =SE_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
  labs(title = "Observed and Permuted Moran's I of SE Residuals",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt6 <- ggplot(data = data.frame(
  residuals = residuals(SE),
  spatial_lag = lag.listw(queenlist, residuals(SE))
), aes(x = residuals, y = spatial_lag)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +  
  geom_smooth(method = "lm", color = "#c44536", se = FALSE) + 
  labs(title = "Moran's I Scatter Plot for SE Residuals",
       x = "Logged Median House Value",
       y = "Spatial Lag of LNMEDHVAL") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt5 + plt6
```

It should be noted that here, we also compared spatial lag model with the spatial error model. The spatial lag model has a lower AIC (525) and SC (564) than the spatial error model (759 and 798, respectively), indicating a better fit. The log-likelihood of the spatial lag model (-256) is also less negative than that of the spatial lag error (-373), suggesting that the spatial lag model better explains the variability in the data. 

## Geographically Weighted Regression Results

```{r covert file structure, message=FALSE, warning=FALSE, include=FALSE}

regDatashps <- as(regData, 'Spatial')  

```


```{r gwr, message=FALSE, warning=FALSE, eval=FALSE, include=FALSE}

bw<-gwr.sel(formula=LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV, 
            data=regDatashps,
            method = "aic",
            adapt = TRUE)
gwrmodel<-gwr(formula=LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV,
              data=regDatashps,
              adapt = bw, # adaptive bandwidth
              gweight=gwr.Gauss,
              se.fit=TRUE, 
              hatmatrix = TRUE)
```


```{r load processed data, message=FALSE, warning=FALSE, include=FALSE}

bw <- readRDS(here("data", "bandwidth_gwr.rds"))
gwrmodel <- readRDS(here("data", "gwrmodel.rds"))
```

The \(R^2\) for the OLS regression model is 0.662, while the \(R^2\) (Quasi-global \(R^2\) ) for the GWR model is 0.848. The higher \(R^2\) value of the GWR model indicates that it does a better job of explaining the variance in the dependent variable \(\text{LNMEDHVAL}\) compared to the OLS model. This improvement suggests that the spatial variability captured by the GWR model provides a more accurate representation of the relationships between the predictors and the dependent variable, as GWR can account for spatial heterogeneity that OLS cannot.

Based on the Akaike Information Criterion (AIC) values, the GWR model, with an AIC of 309, demonstrates the best fit among the models compared. The Spatial Lag model has a higher AIC of 525, followed by the Spatial Error model at 759, and the OLS regression model at 1435. Since a lower AIC indicates a better fit, the GWR model appears to be the most effective in explaining the variance in the data. This suggests that GWR, which accounts for spatial heterogeneity by allowing model coefficients to vary across locations, provides a more accurate and robust fit for this dataset than the other models.


```{r present gwr results, message=FALSE, warning=FALSE}
gwrmodel
```

The choropleth map of local \( R^2 \) values from the GWR model illustrates how the model explains variations in logged median house values across the city. Darker-colored areas, such as the northwestern, parts of the northeastern, and far western regions, indicate a stronger model fit, where the predictors in the model explain a larger proportion of the variability in house values. In contrast, lighter-colored areas, particularly in the northern and western parts of the city, show a weaker model fit, suggesting that the predictors are less effective in capturing the variability in these regions. 

```{r local R-squares choropleth, message=FALSE, warning=FALSE}

gwrresults<-as.data.frame(gwrmodel$SDF)
regDatashps$localR2<-gwrresults$localR2
regDatashps_sf <- st_as_sf(regDatashps)
ggplot(data = regDatashps_sf) +
  geom_sf(aes(fill = localR2), color = NA) +  
  scale_fill_gradientn(colors = c("#FAF9F6", "#c44536"), 
                       name = "Local R-Squared", 
                       na.value = "transparent")+
  labs(title = "Local R-Squared from GWR", x = "", y = "") + 
   theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

```

The analysis indicates that the GWR model has successfully reduced spatial autocorrelation in the residuals compared to the OLS model. In the distribution of Moran’s I values for GWR residuals, the observed Moran’s I is close to zero, suggesting minimal spatial autocorrelation and aligning with the assumption of uncorrelated residuals. In contrast, the Moran’s I for the OLS residuals is both positive and statistically significant, indicating a stronger degree of spatial dependence. 

The spatial lag model also reduces spatial autocorrelation in the residuals, although not as effectively as the GWR model. The spatial error model performs similarly, but the residuals exhibit slightly more spatial autocorrelation than those of the GWR model. Overall, these results suggest that the GWR model provides the best fit in terms of minimizing spatial autocorrelation, capturing local spatial variations more precisely than the spatial lag and spatial error models. This indicates that GWR is particularly effective for modeling spatial heterogeneity in this dataset.

```{r moran I of gwr residuals, message=FALSE, warning=FALSE}

GWR_moranMc<-moran.mc(gwrresults$gwr.e, queenlist, 999, alternative="two.sided")
GWR_moranMc

```

The Moran's I scatter plot for the GWR residuals shows minimal spatial autocorrelation, as indicated by the nearly horizontal trend of the line, suggesting that the GWR model has successfully addressed much of the spatial dependency present in the data. This is consistent with the earlier histogram results, where the Moran's I value for GWR residuals was close to zero, further affirming that the model effectively captures local spatial variations in logged median house values. 

```{r moran I of gwr residuals histogram and scatterplot, fig.width=10, message=FALSE, warning=FALSE}

plt7 <- ggplot(data.frame(res = gwrresults$gwr.e), aes(x = res)) +
  geom_histogram(bins = 100, fill = "#283d3b") +
  geom_vline(xintercept =GWR_moranMc$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
  labs(title = "Observed and Permuted Moran's I of GWR Residuals",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt8 <- ggplot(data = data.frame(
  residuals = gwrresults$gwr.e,
  spatial_lag = lag.listw(queenlist, gwrresults$gwr.e)
), aes(x = residuals, y = spatial_lag)) +
  geom_point(color = "#283d3b", alpha = 0.9, size = 0.6) +  
  geom_smooth(method = "lm", color = "#c44536", se = FALSE) + 
  labs(title = "Moran's I Scatter Plot for GWR Residuals",
       x = "Logged Median House Value",
       y = "Spatial Lag of LNMEDHVAL") +
  theme_light() +   
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8))

plt7 + plt8
```

Looking at each predictors in detail, the vacancy rate generally has a negative impact on housing values, with significant negative effects observed in Center City, northwestern areas, parts of the northeastern region, and the western part of the city. However, there is a localized area in southern Center City where vacancy rates unexpectedly have a positive impact on housing values.

Single-family housing, in general, positively impacts housing values, particularly in the far northeast and far northwest areas, where this effect is more pronounced. Conversely, in some parts of the city—such as eastern Philadelphia, parts of the west, and sections of the south—the impact of single-family housing is negative and statistically significant.

The percentage of residents with a bachelor’s degree generally has a positive impact on housing values, with no areas showing a negative relationship across the city. This positive influence is especially statistically significant in the northern and northeastern parts of the city.

Finally, the poverty rate generally has a negative effect on housing values. This effect is more statistically significant in the southern, eastern, and northwestern parts of the city. 

Overall, these spatial variations highlight how different factors affect housing values uniquely across Philadelphia, emphasizing the importance of localized analysis in understanding property dynamics.

```{r ratio of coefficients to standard error, message=FALSE, warning=FALSE, include=FALSE}

regDatashps_sf$coefPCTVACANTst<-gwrresults$PCTVACANT/gwrresults$PCTVACANT_se
regDatashps_sf$coefPCTSINGLESst<-gwrresults$PCTSINGLES/gwrresults$PCTSINGLES_se
regDatashps_sf$coefPCTBACHMORst<-gwrresults$PCTBACHMOR/gwrresults$PCTBACHMOR_se
regDatashps_sf$coefLNNBELPOVst<-gwrresults$LNNBELPOV/gwrresults$LNNBELPOV_se

```


```{r ratio of coefficients to standard error plot, message=FALSE, warning=FALSE, fig.width=10, fig.height=10}

regDatashps_sf$coefPCTVACANTst_cat <- cut(regDatashps_sf$coefPCTVACANTst, 
                                   breaks = c(-Inf, -2, 0, 2, Inf), 
                                   labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))
regDatashps_sf$coefPCTSINGLESst_cat <- cut(regDatashps_sf$coefPCTSINGLESst, 
                                   breaks = c(-Inf, -2, 0, 2, Inf), 
                                   labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))

regDatashps_sf$coefPCTBACHMORst_cat <- cut(regDatashps_sf$coefPCTBACHMORst,
                                   breaks = c(-Inf, -2, 0, 2, Inf), 
                                   labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))

regDatashps_sf$coefLNNBELPOVst_cat <- cut(regDatashps_sf$coefLNNBELPOVst,
                                   breaks = c(-Inf, -2, 0, 2, Inf), 
                                   labels = c("< -2", "-2 to 0", "0 to 2", "> 2"))


p1 <- ggplot(regDatashps_sf) +
    geom_sf(aes(fill = coefPCTVACANTst_cat), color = NA) +  
    scale_fill_manual(
      values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"), 
      name = "Ratio Category"  
    ) +
   labs(title = "PCTVACANT Coefficient to Standard Error",
       x = "", y = "") +
   theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

p2 <- ggplot(regDatashps_sf) +
    geom_sf(aes(fill = coefPCTSINGLESst_cat), color = NA) +  
    scale_fill_manual(
      values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"), 
      name = "Ratio Category"  
    ) +
   labs(title = "PCTSINGLES Coefficient to Standard Error",
       x = "", y = "") +
   theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))


p3 <- ggplot(regDatashps_sf) +
    geom_sf(aes(fill = coefPCTBACHMORst_cat), color = NA) +  
    scale_fill_manual(
      values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"), 
      name = "Ratio Category"  
    ) +
   labs(title = "PCTBACHMOR Coefficient to Standard Error",
       x = "", y = "") +
   theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

p4 <- ggplot(regDatashps_sf) +
    geom_sf(aes(fill = coefLNNBELPOVst_cat), color = NA) +  
    scale_fill_manual(
      values = c("#c44536", "#f1b1a6", "#8fa7a5", "#283d3b"), 
      name = "Ratio Category"  
    ) +
   labs(title = "LNNBELPOV Coefficient to Standard Error",
       x = "", y = "") +
   theme(legend.text = element_text(size = 9),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

p1 + p2 + p3 + p4 + 
  plot_layout(ncol = 2)
```

# Discussion

In this report, we conducted a comparative analysis of four regression methods—Ordinary Least Squares (OLS), Spatial Lag, Spatial Error, and Geographically Weighted Regression (GWR)—to evaluate their effectiveness in explaining the variance in the dependent variable, \(\text{LNMEDHVAL}\). Our findings indicate that the GWR model exhibited the highest explanatory power, with a quasi-global \(R^2\) of 0.84, compared to the OLS \(R^2\) of 0.662, suggesting that GWR is better suited to capture local variations in the data.

The Akaike Information Criterion (AIC) further supports the suitability of GWR, which achieved an AIC of 309, in contrast to the OLS AIC of 1435, the Spatial Lag AIC of 525, and the Spatial Error AIC of 759. Nevertheless, the lower AIC values in spatial lag and spatial error regression still indicate that both models provide are much better fit than the OLS model, especially the spatial lag model. Another observation is the reduction in the coefficient magnitudes for the predictors in the spatial lag and spatial error models compared to OLS, suggesting that they provide more conservative estimates of the predictors' effects on median home values for accounting spatial autocorrelation. Yet, overall, our analysis demonstrates that the GWR method is the most effective for this dataset. 

While GWR offers several advantages, such as capturing spatial heterogeneity and providing localized insights, it also has limitations, which is why a lot of spatial statisticians use it only for data exploration rather than for modeling. Most notably, a lot of assumptions we have in OLS still hold in GWR, such as linearity, homoscedasticity, and normality of errors. As an example, from our previous assignment, we know that there's multicollinearity between predictors - higher education attainment is clearly associated with lower poverty level. As such, our current GWR model clearly does not account for that. We also know that some of our predictor variables, such as \(\text{PCTSINGLES}\) and \(\text{PCTBACHMOR}\) are not normally distributed, which violates the assumption. 

Other than that, GWR requires at least 300 observations and user's choice of bandwidth can significantly affect results. In our example we have way more than 300 observations and R automatically picked the bandwidths for us, still we should be cautious about the results.

We also need to be careful about the the use of term between **spatially lagged residuals** and **spatial lag model residuals**. The former term, **spatially lagged residuals**, refers to a series of residuals that take into account the influence of nearby observations through a weighting scheme. The latter term, **spatial lag model residuals** refers to the error terms from the result of a model that incorporates a spatial lag term (which is the values of the dependent variable from neighboring observations). 

And lastly, we would like to highlight some limitations of using ArcGIS for GWR. The major issues is that ArcGIS does not always allow for the same level of flexibility in selecting bandwidth methods or kernel functions as R, and it has a very different and very confusing way of optimizing bandwidth. In addition, ArcGIS Pro does not give AIC, just AICc, which makes it difficult to compare GWR output to other models. More importantly, in the current and earlier versions of ArcGIS Pro, some local R-squares are negative, which is not possible in a regression model. Therefore, we recommend using R for GWR analysis to avoid these limitations and to have more control over the model selection process.





# Reference

Anselin, L. (1988). Spatial econometrics: Methods and models. Kluwer Academic Publishers.

Atkinson, R. (2004). The evidence on the impact of gentrification: New lessons for the urban renaissance? European Journal of Housing Policy, 4(1), 107–131. https://doi.org/10.1080/1461671042000215479

Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). Geographically weighted regression: A method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281–298. https://doi.org/10.1111/j.1538-4632.1996.tb00936.x

Freeman, L., & Braconi, F. (2004). Gentrification and displacement. Journal of the American Planning Association, 70(1), 39–52. https://doi.org/10.1080/01944360408976337

Galster, G. C. (2008). Quantifying the effect of neighborhood land use and zoning on housing prices. Regional Science and Urban Economics, 38(3), 291–305. https://doi.org/10.1016/j.regsciurbeco.2008.03.003

Glaeser, E. L., & Gyourko, J. (2018). Rethinking federal housing policy: How to make housing plentiful and affordable. American Enterprise Institute.

Lees, L. (2008). Gentrification and social mixing: Towards an inclusive urban renaissance? Urban Studies, 45(12), 2449–2470. https://doi.org/10.1177/0042098008097099

LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics. CRC Press.

Mallach, A. (2018). The divided city: Poverty and prosperity in urban America. Island Press. https://doi.org/10.5822/978-1-61091-781-0

Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography, 46, 234–240. https://doi.org/10.2307/143141
